Extending NLP Tools Repositories for the Interaction with Language
Data Resources Repositories
Thierry Declerck
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbruecken
Germany
declerck@dfki.de
Abstract
This short paper presents some mo-
tivations behind the organization of
the ACL/EACL01 “Workshop on Shar-
ing Tools and Resources for Research
and Education”, concentrating on the
possible connection of Tools and Re-
sources repositories. Taking some pa-
pers printed in this volume and the ACL
Natural Language Software Registry as
a basis, we outline some of the steps to
be done on the side of NLP tool reposi-
tories in order to achieve this goal.
1 Introduction
The main goal of the ACL/EACL01 “Workshop
on Sharing Tools and Resources for Research and
Education” is to discuss methods for the improve-
ment and extension of existing repositories. In
this paper we briefly address one of the central
discussion point of the workshop: how to achieve
a close interlinking between NLP tools and NL
resources repositories. We will base this discus-
sion on the ACL Natural Language Software Reg-
istry (see (Declerck et al., 2000)) and some papers
printed in these proceedings (see the list of papers
in the bibliography).
The necessity of having repositories for NLP
tools has already been clearly recognized in the
past, and recently this topic has also been ad-
dressed within the broader context of a confer-
ence on Language Resources (see (Chaudiron et
al., 2000) and (Declerck et al., 2000)). (Chaud-
iron et al., 2000) is essentially concerned with the
question of identifying the NLP supply according
to its different uses, and thus is describing a user-
oriented approach to NLP tools repositories. (De-
clerck et al., 2000) is mainly describing the func-
tionalities of the new version of the ACL Natural
Language Software Registry, also showing how
this version can overcome some of the practical
problems encountered by former repositories (a
summarized presentation of the ACL Registry is
given below in section 2). Both papers are also
discussing the problem of proposing a good tax-
onomy of NLP tools: user oriented versus de-
veloper oriented, top-down versus bottom-up ap-
proach, coarse-grained versus fine-grained classi-
fication and the way those classification strategies
could cooperate. So for sure there is also still a
need for establishing a cooperation between dis-
tinct approaches to NLP tools classification and
their implementation, and a corresponding dis-
cussion is going on.
But since NLP tools are of interest only if they
have language data they can process and trans-
form, and Language Data Resources are only of
interest if there is a clear indication on how they
can be accessed and processed, there is a also
a real need of establishing descriptive links be-
tween the two types of repositories in which tools
on the one side and language data resources on
the other side are included. This will allow people
using a certain tool to easily find the type of lan-
guage data they need. And the other way round:
people having language data can easily find the
type of tools that can produce some added-value
for their data. The successful establishment of
such a connection between these two types of
repositories will probably require as well a par-
tial reorganization of the NLP repositories on the
one hand and the language data repositories on
the other hand in order to maximally respond to
the overall requirement of what at the end will be
an infrastructure
1
for discovering, accessing and
combining language related resources and tools.
This paper is specially addressing some of the
extensions the ACL Registry is undergoing in or-
der to offer a valuable contribution to this infras-
tructure.
2 The ACL Natural Language Software
Registry
The Natural Language Software Registry (NLSR)
is a concise summary of the capabilities and
sources of a large amount of natural language
processing (NLP) software available to the NLP
community.
2
It comprises academic, commercial
and proprietary software with specifications and
terms on which it can be acquired clearly indi-
cated.
The visitor of the NLSR has two types of
access to the information stored in the NLSR:
browsing through the hierarchically organized list
of products (the maximal depth for browsing is
level 3) or by querying for the specifications of
the products as they are listed in the Registry.
This querying functionality is helping the visi-
tor in finding potential relevant software, since he
or she is be able to formulate standard queries,
whereas a menu allows to constrain the search to
certain aspects of the listed products. So it is pos-
sible to query for example for all freely available
morphological analyzer for Spanish running on a
specific platform. Products can be listed in dis-
tinct sections. In order to know in which sections
a product is to be found, the user can submit a
standard query to the Registry Database.
The underlying classification of the actual ver-
sion of the ACL Registry is largely based on the
book (Varile and Zampolli, 1996). But this taxon-
omy will probably have to be further specialized
and extended in order to satisfy the majority of
the visitors of the NLSR. Therefore the classifi-
cation can be enriched by the products submitted
1
As (Bird and Simons, 2001) names it.
2
See http://registry.dfki.de/
and/or by comments made by the visitors, intro-
ducing thus a bottom-up, developer and/or user
oriented classification.
A general goal of the most recent editions of
the NLSR was the simplification of the registra-
tion procedure, providing a short form to be filled
by the customer. We do not request anymore an
exhaustive description of the submitted product,
but concentrate on few points providing a guiding
for the visitor, who will have to consult the home
page of the institutions or authors having submit-
ted their product for getting more detailed infor-
mation. In accordance with this simplification of
the registration procedure, institutes or companies
submitting their NLP products to the ACL Natural
Language Software Registry are required to give
their URL.
3 Extending the ACL Natural Language
Software Registry
The ACL Registry was till recently a closed
world, in the sense that information encoded in
it could be accessed only by browsing or query-
ing within its web page. Obviously there is a
need for getting access to this information with-
out having to activate a web browser. Therefore
it was planned to provide for an XML export,
since XML is the standard for exchanging struc-
tured documents. And this need was getting even
more urgent after the Registry Team was asked
for permission of harvesting the ACL repository
for the purpose of creating a prototype service
provider in the context of an Open Archive Ini-
tiative for Language Resources, which is called
OLAC (Open Language Archives Community)
and described in (Bird and Simons, 2001).
This excellent initiative also requires that the
information provided by tools repositories is not
only universally available but also has to con-
form to certain standards for metadata descrip-
tion. This in order to ensure the interoperability
across all the repositories participating as meta-
data providers in OLAC.
4 XML for Tools Repositories
(Erjavec and V´aradi, 2001) are proposing a very
interesting description of the TELRI-II concerted
action for a tool catalogue specialized for cor-
pus processing tools. This “limitation” in the
coverage of the repository TELRI repository is
allowing the authors to make extensive experi-
ments with various XML specifications and tools
for the building and display of their catalogue.
An experience which should be beneficial for
the more generic ACL Registry, as well as for
other provider of tools repositories (so for ex-
ample national initiatives, like the one described
in (Chaudiron et al., 2000)). The authors also
mention one advantage of the limitation in the
coverage of tools: the presence in the entries
of a pointer to persons or institutions being able
to offer advice on installing and using the soft-
ware. Thus addressing also one point mentioned
in (Bird and Simons, 2001), where 3 main classes
of providers are described: DATA, TOOLS and
ADVICE providers.
But (Erjavec and V´aradi, 2001) are not propos-
ing a discussion on how to integrate in the de-
scription of the tools the particular relation to
a specific corpus. Nevertheless this should be
a common task to be tackled by all providers
of tools repositories. Probably it would be the
best strategy to start with specialized repositories,
where the problems to solve can appear earlier.
5 Metadata for NLP Tools
As we saw above, the sole conformance to stan-
dards (XML) for document description and inter-
change is not enough in the context of OLAC. But
the use of metadata descriptions for tools seems to
make sense not only for such initiatives. (Lavelli
et al., 2001) show the use of metadata descrip-
tion for tools in the context of an infrastructure for
NLP application development. The role of meta-
data there is to specify the “level of analysis ac-
complished by the source processor”. Thus the
metadata descriptions are useful for the commu-
nication between processes within an NLP chain,
and also allow to mark and identify the document
produced by such a process. In any cases, the
use of metadata description for tools (or processes
triggered by those tools) is probably a key-issue in
the modular design of complex NLP environment.
And one can see in the SiSSA approach to
metadata descriptions for NLP processes, maybe
as a side effect, a proposition for sharing anno-
tations for processes and documents (resources)
that can be handled. This might be a starting point
for the systematic connection of the descriptions
of both NLP tools and language resources.
6 Connection with
Metadata-Descriptions for
(Multimedia/Mltimodal) Language
Resources
Catalogue and repositories for Natural Language
data resources have already been working on the
topic of metadata description for their entries (See
for example LDC and ELRA). One can see OLAC
as a natural extension of the LDC, enlarging the
resources catalogue to a real infrastructure for
language resource identification.
From the side of the Language Engineering
there are initiatives for describing standards and
(Calzolari et al., 2001) present such an initia-
tive, the ISLE project, which is the continuation
of the EAGLES initiative. The main objective
of ISLE is to promote “widely agreed and ur-
gently demanded standards and guidelines for in-
frastructural language resources ..., tools that ex-
ploit them and LE products”. The ongoing dis-
cussions within this project are thus important for
the intended extension of NLP tools repositories.
While (Calzolari et al., 2001) concentrate on
the description of the task of the ISLE compu-
tational lexicon working group and address the
topic of metadata for encoding multilingual lex-
ical resources, (Broeder and Wittenburg, 2001)
presents the work of the ISLE Metadata initiative
(IMDI), which is directly relevant for the topic
addressed here. (Broeder and Wittenburg, 2001)
give a good overview of metadata initiatives for
Language Resources and propose a contrastive
description of OLAC and IMDI, where the main
distinction can be seen in the top-down versus
bottom-up approach. The top-down approach fol-
lowed by OLAC allows an easy conformance to
the Dublin Core set, whereas the bottow-up ap-
proach requires the definition of more “narrow
and specialized categorization schemes”.
This distinction is important for the intended
extension of the metadata description for NLP
tools, since the description of the tools will have
to connect to those distinct kinds of categorization
schemes for data resources. We think here that the
ACL Registry can easily be adapted to this situ-
ation since the actual classification of tools is a
layered one, one layer being quite general (clas-
sifying tools wrt broader application types, like
“Written Language”), and the next layer stressing
more the specific technology (for example Infor-
mation Extraction versus Text Alignment).
(Broeder and Wittenburg, 2001) is also propos-
ing a scheme for connecting the descriptions of
tools and resources. They suggest not to include a
listing of tools in the metadata description of the
resources, since this set of tools would be chang-
ing in time. Rather they suggest a detailed de-
scription of the type and the structure of the re-
sources that can be accessed by a “browser” tool,
which on the basis of the detailed metadata de-
scription can select potential tools for handling
the resources. The tools repository would have
to include this kind of information in its metadata
description of the tools.
7 Conclusion
As we could see out of this (not exhaustive) se-
lection of papers submitted to the ACL/EACL01
“Workshop on Sharing Tools and Resources for
Research and Education”, there are a lot of very
interesting and promising, implicit or explicit,
suggestions for the goal of connecting tools and
resources repositories. The ACL Natural Lan-
guage Registry will take these suggestions as the
basis of the further work on providing extensions
to metadata descriptions in order to be as com-
pliant as possible to emerging infrastructures and
standards for language resources.

References
S. Bird and G. Simons. 2001. The OLAC Metadata
Set and Controlled Vocabularies. In This volume.
D. Broeder and P. Wittenburg. 2001. Interaction
of Tools and Metada-Descriptions for Multimedia
Language Resources. In This volume.
N. Calzolari, A. Lenci, and A. Zampolli. 2001. Inter-
national Standards for Multilingual Resource Shar-
ing: The ISLE Computational Lexicon Working
Group. In This volume.
S. Chaudiron, K. Choukri, A. Mance, and V. Mapelli.
2000. For a repository of NLP tools. In LREC 00,
pages 1273–1278.
T. Declerck, A.W. Jachmann, and H. Uszkoreit. 2000.
The new Edition of the Natural Language Software
Registry (an initiative of ACL hosted at DFKI). In
LREC 00, pages 1129–1132. http://registry.dfki.de.
T. Erjavec and T. V ´aradi. 2001. The TELRI tool cata-
logue: structure and prospect. In This volume.
A. Lavelli, F. Pianesi, E. Maci, I. Prodanof, L. Dini,
and G. Mazzini. 2001. SiSSA – An Infrastructure
for NLP Application Development. In This volume.
G.B. Varile and A. Zampolli. 1996. Survey of
the State of the Art in Human Language Tech-
nology. http://www.cse.ogi.edu/CSLU/HLTsurvey/
HLTsurvey.html.
