File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1502_metho.xml
Size: 10,388 bytes
Last Modified: 2025-10-06 14:07:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1502"> <Title>Extending NLP Tools Repositories for the Interaction with Language Data Resources Repositories</Title> <Section position="3" start_page="1" end_page="2" type="metho"> <SectionTitle> 2 The ACL Natural Language Software Registry </SectionTitle> <Paragraph position="0"> The Natural Language Software Registry (NLSR) is a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community.</Paragraph> <Paragraph position="1"> It comprises academic, commercial and proprietary software with specifications and terms on which it can be acquired clearly indicated. null The visitor of the NLSR has two types of access to the information stored in the NLSR: browsing through the hierarchically organized list of products (the maximal depth for browsing is level 3) or by querying for the specifications of the products as they are listed in the Registry. This querying functionality is helping the visitor in finding potential relevant software, since he or she is be able to formulate standard queries, whereas a menu allows to constrain the search to certain aspects of the listed products. So it is possible to query for example for all freely available morphological analyzer for Spanish running on a specific platform. Products can be listed in distinct sections. In order to know in which sections a product is to be found, the user can submit a standard query to the Registry Database.</Paragraph> <Paragraph position="2"> The underlying classification of the actual version of the ACL Registry is largely based on the book (Varile and Zampolli, 1996). But this taxonomy will probably have to be further specialized and extended in order to satisfy the majority of the visitors of the NLSR. Therefore the classification can be enriched by the products submitted and/or by comments made by the visitors, introducing thus a bottom-up, developer and/or user oriented classification.</Paragraph> <Paragraph position="3"> A general goal of the most recent editions of the NLSR was the simplification of the registration procedure, providing a short form to be filled by the customer. We do not request anymore an exhaustive description of the submitted product, but concentrate on few points providing a guiding for the visitor, who will have to consult the home page of the institutions or authors having submitted their product for getting more detailed information. In accordance with this simplification of the registration procedure, institutes or companies submitting their NLP products to the ACL Natural Language Software Registry are required to give their URL.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Extending the ACL Natural Language </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Software Registry </SectionTitle> <Paragraph position="0"> The ACL Registry was till recently a closed world, in the sense that information encoded in it could be accessed only by browsing or querying within its web page. Obviously there is a need for getting access to this information without having to activate a web browser. Therefore it was planned to provide for an XML export, since XML is the standard for exchanging structured documents. And this need was getting even more urgent after the Registry Team was asked for permission of harvesting the ACL repository for the purpose of creating a prototype service provider in the context of an Open Archive Initiative for Language Resources, which is called OLAC (Open Language Archives Community) and described in (Bird and Simons, 2001).</Paragraph> <Paragraph position="1"> This excellent initiative also requires that the information provided by tools repositories is not only universally available but also has to conform to certain standards for metadata description. This in order to ensure the interoperability across all the repositories participating as meta-data providers in OLAC.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 XML for Tools Repositories </SectionTitle> <Paragraph position="0"> (Erjavec and V'aradi, 2001) are proposing a very interesting description of the TELRI-II concerted action for a tool catalogue specialized for corpus processing tools. This &quot;limitation&quot; in the coverage of the repository TELRI repository is allowing the authors to make extensive experiments with various XML specifications and tools for the building and display of their catalogue.</Paragraph> <Paragraph position="1"> An experience which should be beneficial for the more generic ACL Registry, as well as for other provider of tools repositories (so for example national initiatives, like the one described in (Chaudiron et al., 2000)). The authors also mention one advantage of the limitation in the coverage of tools: the presence in the entries of a pointer to persons or institutions being able to offer advice on installing and using the software. Thus addressing also one point mentioned in (Bird and Simons, 2001), where 3 main classes of providers are described: DATA, TOOLS and ADVICE providers.</Paragraph> <Paragraph position="2"> But (Erjavec and V'aradi, 2001) are not proposing a discussion on how to integrate in the description of the tools the particular relation to a specific corpus. Nevertheless this should be a common task to be tackled by all providers of tools repositories. Probably it would be the best strategy to start with specialized repositories, where the problems to solve can appear earlier.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Metadata for NLP Tools </SectionTitle> <Paragraph position="0"> As we saw above, the sole conformance to standards (XML) for document description and interchange is not enough in the context of OLAC. But the use of metadata descriptions for tools seems to make sense not only for such initiatives. (Lavelli et al., 2001) show the use of metadata description for tools in the context of an infrastructure for NLP application development. The role of meta-data there is to specify the &quot;level of analysis accomplished by the source processor&quot;. Thus the metadata descriptions are useful for the communication between processes within an NLP chain, and also allow to mark and identify the document produced by such a process. In any cases, the use of metadata description for tools (or processes triggered by those tools) is probably a key-issue in the modular design of complex NLP environment.</Paragraph> <Paragraph position="1"> And one can see in the SiSSA approach to metadata descriptions for NLP processes, maybe as a side effect, a proposition for sharing annotations for processes and documents (resources) that can be handled. This might be a starting point for the systematic connection of the descriptions of both NLP tools and language resources.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Resources </SectionTitle> <Paragraph position="0"> Catalogue and repositories for Natural Language data resources have already been working on the topic of metadata description for their entries (See for example LDC and ELRA). One can see OLAC as a natural extension of the LDC, enlarging the resources catalogue to a real infrastructure for language resource identification.</Paragraph> <Paragraph position="1"> From the side of the Language Engineering there are initiatives for describing standards and (Calzolari et al., 2001) present such an initiative, the ISLE project, which is the continuation of the EAGLES initiative. The main objective of ISLE is to promote &quot;widely agreed and urgently demanded standards and guidelines for infrastructural language resources ..., tools that exploit them and LE products&quot;. The ongoing discussions within this project are thus important for the intended extension of NLP tools repositories.</Paragraph> <Paragraph position="2"> While (Calzolari et al., 2001) concentrate on the description of the task of the ISLE computational lexicon working group and address the topic of metadata for encoding multilingual lexical resources, (Broeder and Wittenburg, 2001) presents the work of the ISLE Metadata initiative (IMDI), which is directly relevant for the topic addressed here. (Broeder and Wittenburg, 2001) give a good overview of metadata initiatives for Language Resources and propose a contrastive description of OLAC and IMDI, where the main distinction can be seen in the top-down versus bottom-up approach. The top-down approach followed by OLAC allows an easy conformance to the Dublin Core set, whereas the bottow-up approach requires the definition of more &quot;narrow and specialized categorization schemes&quot;.</Paragraph> <Paragraph position="3"> This distinction is important for the intended extension of the metadata description for NLP tools, since the description of the tools will have to connect to those distinct kinds of categorization schemes for data resources. We think here that the ACL Registry can easily be adapted to this situation since the actual classification of tools is a layered one, one layer being quite general (classifying tools wrt broader application types, like &quot;Written Language&quot;), and the next layer stressing more the specific technology (for example Information Extraction versus Text Alignment).</Paragraph> <Paragraph position="4"> (Broeder and Wittenburg, 2001) is also proposing a scheme for connecting the descriptions of tools and resources. They suggest not to include a listing of tools in the metadata description of the resources, since this set of tools would be changing in time. Rather they suggest a detailed description of the type and the structure of the resources that can be accessed by a &quot;browser&quot; tool, which on the basis of the detailed metadata description can select potential tools for handling the resources. The tools repository would have to include this kind of information in its metadata description of the tools.</Paragraph> </Section> </Section> class="xml-element"></Paper>