File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1036_metho.xml

Size: 25,515 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1036">
  <Title>An Open Distributed Architecture for Reuse and Integration of Heterogeneous NLP Components</Title>
  <Section position="4" start_page="0" end_page="246" type="metho">
    <SectionTitle>
2 Reuse in NLP
</SectionTitle>
    <Paragraph position="0"> There is an increasing amount of shared corpora and lexical resources that are being made available for NLP researchers through managed data repositories such as LDC, CLR, ELRA, etc. (see e.g., (Wilks et al. 92) for an overview of these repositories).</Paragraph>
    <Paragraph position="1"> These resources constitute the basic raw materials for building NLP software but not all of these resources can be readily used: they might be available in formats that require extensive pre-processing to transform them into resources that are tractable by NLP software. This pre-processing cannot usually be fully automated and is therefore costly.</Paragraph>
    <Paragraph position="2"> Some projects have concentrated on developing lexical resources directly in a format suitable for further use in NLP software (e.g., Genelex, Multilex). These projects go beyond the definition of interchange formats to define a &amp;quot;neutral&amp;quot; linguistic representation in which all lexical knowledge is encoded and from which, by means of specialized compilers, application-specific dictionaries can be extracted. The lexical knowledge encoded in these systems can truly be called reusable since neither  the format nor the content is application-dependent.</Paragraph>
    <Paragraph position="3"> The result of these projects is however not available to the research community.</Paragraph>
    <Paragraph position="4"> Reuse of NLP software components remains much more limited (Cunningham et al. 96) since problems are compounded: the software components of an NLP system need not only to be able to exchange data using the same format (e.g., feature structures) and to share the same interpretation of the information they exchange (same linguistic theory, e.g., LFG), but they also need to communicate at the process level, either through direct API calls if they are written in the same programming language or through other means if, for example, they have to run on different platforms--a classical software integration problem. Thus, reuse of NLP software components can be defined as an integration problem.</Paragraph>
    <Paragraph position="5"> It is not of course the only approach to reuse in NLP (see for example (Biggerstaff &amp; Perlis 89) for an overview of alternative approaches to software reuse) and some previous efforts have, for example, been directed at building Integrated Development Environments ((Boitet et al. 82; Simkins 94; Alshawi 92; Grover et al. 93) to mention but a few). Although Integrated Development Environments address some of the problems, they do not give a complete solution since one still has to develop rules and lexical entries using these systems.</Paragraph>
    <Paragraph position="6"> Direct reuse of NLP software components, e.g., using an existing morphological analyzer as a component of a larger system, is still very limited but is nevertheless increasingly attractive since the development of large-scale NLP applications, a focus of current NLP research, is prohibitive for many research groups. The Tipster architecture for example is directed towards the development of information retrieval and extraction systems (ARPA 94; Grishman 95) and provides a modular approach to component integration. The GATES architecture builds upon the Tipster architecture and provides a graphical development environment to test integrated applications (Cunningham et al. 96). Speech machine-translation architectures need also to solve difficult integration problems and original solutions have been developed in the Verbmobil project (GSrz et al. 96), and by researchers at ATR (e.g., (Boitet &amp; Seligman 94)) for example. A generic NLP architecture needs to address component communication and integration at three distinct levels:  1. The process or communication layer involves,  for example, communication between different components that could be written in different programming languages and could be running as different processes on a distributed network.</Paragraph>
    <Paragraph position="7"> 2. The data layer involves exchange and translation of data structures between components.</Paragraph>
    <Paragraph position="8"> . At the linguistic level, components need to share the same interpretation of the data they exchange.</Paragraph>
    <Paragraph position="9"> A particular NLP architecture embodies design choices related to how components can talk to each other. A variety of solutions are possible as illustrated below.</Paragraph>
    <Paragraph position="10"> * Each component can talk directly to each other and thus all components need to incorporate some knowledge about each other at all three levels mentioned above. This is the solution adopted in the Verbmobil architecture which makes use of a special communication software package (written in C and imposing the use of C and Unix) at the process level and uses a chart annotated with feature structures at the data-structure level. At the linguistic level, a variant of HPSG is used (Kesseler 94; Amtrup 95; Turk &amp; Geibler 95; GSrz et al. 96).</Paragraph>
    <Paragraph position="11"> * A central coordinator can incorporate knowledge about each component but the component themselves don't have any knowledge about each other, or even about the coordinator. Filters are needed to transform data back and forth between the central data-structure managed by the coordinator (a lattice would be appropriate) and the data processed by each component. Communication between the coordinator and the components can be asynchronous and the coordinator needs then to serialize the actions of each component. This solution, a variant of the blackboard architecture (Erman &amp; Lesser 80) is used in the Kasuga speech translation prototype described in (Boitet &amp; Seligman 94). This architecture imposes no constraints on the components (programming language or software architecture) since communication is based on the SMTP protocol.</Paragraph>
    <Paragraph position="12"> * The Tipster Document Architecture makes no assumption about the solution used either at the process level or at the linguistic level. At the data structure level, NLP components exchange data by reading and writing &amp;quot;annotations&amp;quot; associated with some segment of a document (Grishman 95). This solution also forms the basis of the GATES system (Cunningham et al. 96). Various versions of this architecture  have been developed (in C, C++ and Lisp) but no support is defined for integration of heterogeneous components. However, in the Tipster Phase III program, a CORBA version of the Tipster architecture will be developed to support distributed processing.</Paragraph>
  </Section>
  <Section position="5" start_page="246" end_page="249" type="metho">
    <SectionTitle>
3 The Corelli Document Processing
</SectionTitle>
    <Paragraph position="0"> Architecture The Corelli Document Processing Architecture is an attempt to address the various problems mentioned above and also some other software-level engineering issues such as robustness, portability, scalability and inter-language communication (for integrating components written in Lisp, C or other languages).</Paragraph>
    <Paragraph position="1"> Also of interest are some ergonomic issues such as tractability, understandability and ease of use of the architecture (the programmer being the user in this case). The architecture provides support for component communication and for data exchange. No constraint is placed on the type of linguistic processing but a small library of data-structures for NLP is provided to ease data-conversion problems.</Paragraph>
    <Paragraph position="2"> The data layer implements the Tipster Document Architecture and enables the integration of Tipster-compliant components. This architecture is geared to support the development of large-scale NLP applications such as Information Retrieval systems, multilingual MT systems (Vanni &amp; Zajac 96), hybrid or multi-engine MT systems (Wilks et al. 92; Frederking et al. 94; Sumita &amp; Iida 95), speech-based systems (Boitet &amp; Seligman 94; G5rz et al. 96) and also systems for the exploration and exploitation of large corpora (Ballim 95; Thompson 95).</Paragraph>
    <Paragraph position="3"> Basic software engineering requirements * A modular and scalable architecture enables the development of small and simple applications using a file-based implementation such as a grammar checker, as well as large and resource-intensive applications (information retrieval, machine translation) using a database back-end (with two levels of functionality allowing for a single-user persistent store and a full-size commercial database).</Paragraph>
    <Paragraph position="4"> * A portable implementation allows the development of small stand-alone PC applications as well as large distributed Unix applications.</Paragraph>
    <Paragraph position="5"> Portability is ensured through the use of the Java programming language.</Paragraph>
    <Paragraph position="6"> * A simple and small API which can be easily learned and does not make any presupposition about the type of application. The AP! is defined using the IDL language and structured according to CORBA standards and the CORBA services architecture (OMG 95).</Paragraph>
    <Paragraph position="7"> A dynamic Plug'n Play architecture enabling easier integration of components written in different programming languages (C, C++, Lisp, Java, etc), where components are &amp;quot;wrapped&amp;quot; as tools supporting a common interface.</Paragraph>
    <Section position="1" start_page="246" end_page="247" type="sub_section">
      <SectionTitle>
3.1 Data Layer: Document Services
</SectionTitle>
      <Paragraph position="0"> The data layer of the Corelli Architecture is derived from the Tipster Architecture and implements the requirements listed above. In this architecture, components do not talk directly to each other but communicate through information (so-called 'annotations') attached to a document. This model reduces inter-dependencies between components, promoting the design of modular applications (Figure  1) and enabling the development of blackboard-type applications such as the one described in (Boitet &amp; Seligman 94). The architecture provides solutions for * Representing information about a document, * Storing and retrieving this information in an efficient way, * Exchanging this information among all compo- null nents of an application.</Paragraph>
      <Paragraph position="1"> It does not however provide a solution for translating linguistic structures (e.g., mapping a dependency tree to a constituent structure). These problems are application-dependent and need to be resolved on a case-by-case basis; such integration is feasible, as demonstrated by the various Tipster demonstration systems, and use of the architecture reduces significantly the load of integrating a component into the application.</Paragraph>
      <Paragraph position="2"> Documents, Annotations and Attributes The data layer of the Corelli Document Processing Architecture follows the Tipster Architecture. The basic data object is the document. Documents can have attributes and annotations, and can be grouped into collections. Annotations are used to store information about a particular segment of the document (identified by a span, i.e., start-end byte offsets in the document content) while the document itself remains unchanged. This contrasts with the SGML solution used in the Multext project where information about a piece of text is stored as additional SGML mark-up in the document itself (Ballim 95;  Thompson 95). This architecture supports read-only data (e.g., data stored in a CD-ROM) as well as writable data. Annotations are attributed objects that contain application objects. They can be used, for example, to store morphological tags produced by some tagger, to represent the HTML structure of an HTML document or to store partial results of a  data-structure enable modular architectures and reduce the number of interfaces from the order of n 2 to the order of n.</Paragraph>
    </Section>
    <Section position="2" start_page="247" end_page="247" type="sub_section">
      <SectionTitle>
Document Annotations
</SectionTitle>
      <Paragraph position="0"> Corelli document annotations axe essentially the same as Tipster document annotations and a similar generic interface is provided. However, considering the requirements of NLP applications such as parsers or documents browsers, two additional interfaces are provided: * Since a set of annotations can be quite naturally interpreted as a chart, a chart interface provides efficient access to annotations viewed as a directed graph following the classical model of the chart first presented in (Kay 73).</Paragraph>
      <Paragraph position="1"> * An interv~-tree interface provides efficient access for efficient implementation of display functionalities. null  An application manipulating only basic data types (strings, numbers,...) need not define application objects. However, some applications may want to store complex data structures as document annotations, for example, trees, graphs, feature structures, etc. The architecture provides a top application-object class that can be sub-classed to define specific application objects. To support persistency in the file: based version, an application object needs to implement the read-persistent and write-persistent interfaces (this is provided transparently by the persistent versions). A small library of application objects is provided with the architecture.</Paragraph>
    </Section>
    <Section position="3" start_page="247" end_page="248" type="sub_section">
      <SectionTitle>
Accessing Documents
</SectionTitle>
      <Paragraph position="0"> Documents are accessible via a Document Server which maintains persistent collections, documents and their attributes and annotations. An application can define its own classes for documents and collections. In the basic document class provided in the architecture, a document is identified by its name (URL to the location of the document's content). In this distributed data model, accessing a document via a Document Server gives access to a document's contents and to attributes and annotations of a document.</Paragraph>
      <Paragraph position="1">  such as CORBA for defining inter-operable interfaces, and HTTP for data transport. Following the CORBA model, the Architecture is structured as a  set of services with well- defined interfaces: * A Document Management Service (DMS) provides functions for manipulating collections, documents, annotations and attributes.</Paragraph>
      <Paragraph position="2"> * A Life-Cycle Service provides creation, copying, moving and deletion of objects.</Paragraph>
      <Paragraph position="3"> * A Naming Service provides access to documents and collections via their names. Named collections and documents are persistent.</Paragraph>
      <Paragraph position="4"> Figure 2 gives an overview of the Corelli Document Architecture: an NLP component accesses a Document Service provided by a Document Server using the Corelli Document Architecture API. Client-side application component API calls on remote object references (requested from the Orb).  are transparently 'transferred' by the Orb to a Document Services implementation object for invocation. Figure 3 describes the Java IDL compiler and Java Door Orb interaction. The Corelli Document Architecture API is specified using the Interface Definition Language (IDL), a standard defined by the Object Management Group (OMG 95). The IDL-to-Java compiler essentially produces three significant files: one containing a Java interface corresponding to the IDL operational interface itself, a second containing client-side 'stub' methods to invoke on remote object references (along with code to handle Orb communication overhead), and a third containing server-side 'skeleton' methods to handle implementation object references. What remains is for the server code, implementing the IDL operational interface to be developed. null When the server implementing the IDL specification is launched, it creates skeleton object references for implemented services/objects and publishes them on the Orb. A client wishing to invoke methods on those remote objects creates stub object references and accesses the orb to resolve them with the implementation references on the server side. Any client API call made on a resolved object reference is then transparently (to the client) invoked on the corresponding server-side object.</Paragraph>
      <Paragraph position="5"> The Document Management Service, the Life-Cycle Service and the Naming Service are included in the three versions of the architecture which implement increasingly sophisticated support of database functionalities: . The basic file-based version of the architecture uses the local file system to store persistent data (collections, attributes and annotations); the contents of a document can however be located anywhere on the Internet.</Paragraph>
      <Paragraph position="6"> . A persistent store version uses a persistent-store back-end for storing and retrieving collections, attributes and annotations: this version supports the Persistent Object Service which provides greater efficiency for storing and accessing persistent objects as well as enhanced support for defining persistent application objects.</Paragraph>
      <Paragraph position="7"> . A database version uses a commercial database management system to store and retrieve collections, attributes and annotations and also documents (through an import/export mechanism).</Paragraph>
      <Paragraph position="8"> This version provides a Concurrency Control Service and a Transaction Service.</Paragraph>
      <Paragraph position="9"> Communication Layer To support integration and communication at the process level, the current version of the Corelli Architecture provides component inter-communication via the Corelli Plug'n Play architecture (see below) and the Java Door Orb.</Paragraph>
    </Section>
    <Section position="4" start_page="248" end_page="249" type="sub_section">
      <SectionTitle>
3.2 Plug'n Play Architecture
</SectionTitle>
      <Paragraph position="0"> The data layer of the Corelli Document Architecture, as described above, provides a static model for component integration through a common data framework. This data model does not provide any support for communication between components, i.e., for executing and controlling the interaction of a set of components, nor for rapid tool integration.</Paragraph>
      <Paragraph position="1"> The Corelli Plug'n Play layer aims at filling this gap by providing a dynamic model for component integration: this framework provides a high-level of plug-and-play, allowing for component interchangeability without modification of the application code, thus facilitating the evolution and upgrade of individual components.</Paragraph>
      <Paragraph position="2"> In the preliminary version of the Corelli Plug'n Play layer, the choice was made to develop the most general version of the architecture to ensure that any tool can be integrated using this framework. In this model, all components run as servers and the application code which implements the logic of the application runs as a client of the component servers. To be integrated, a component needs to support synchronous or asynchronous versions of one or several of four basic operations: execute, query, convert and exchange (in addition to standard initialization ad termination operations). Client-server communication is supported by the Java Door Orb.</Paragraph>
      <Paragraph position="3"> The rationale for this architecture is that many NLP tools are themselves rather large software corn- null ponents, and embedding them in servers helps to reduce the computation load. For example, some morphological analyzers load their dictionary in the process memory, and on small documents, simply starting the process could take more time than actual execution. In such cases, it is more efficient to run the morphological analyzer as a server that can be accessed by various client processes. This architecture also allows the processing load of an application to be distributed by running the components on several machines accessible over the Internet, thereby enabling the integration of components running on widely different architectures. This model also provides adequate support for the integration of static knowledge sources (such as dictionaries) and of ancillary tools (such as codeset converters).</Paragraph>
      <Paragraph position="4"> Figure 4 gives a picture of one possible integration solution. In this example, each component of the application is embedded in a server which is accessed through the Corelli Component Integration API as described above. A component server translates an incoming request into a component action.</Paragraph>
      <Paragraph position="5"> The server also acts as a filter by translating the document data structures stored in the Document Server in a format appropriate as input for the component and conversely for the component output.</Paragraph>
      <Paragraph position="6"> Each component server acts as a wrapper and several solutions are possible: . If the component has a Java API, it can be encapsulated directly in the server.</Paragraph>
      <Paragraph position="7"> * If the component has an API written in one of the languages supported by the Java Native Interface (currently C and C++), it can be dynamically loaded into the server at runtime and accessed via a Java front end.</Paragraph>
      <Paragraph position="8">  * If the component is an executable, the server must issue a system call for running the program and data communication usually occurs through files.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="249" end_page="250" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="249" end_page="249" type="sub_section">
      <SectionTitle>
4.1 Document Server Implementation
</SectionTitle>
      <Paragraph position="0"> The Document Server consists of three major modules: Document Management Service, Naming Service, and Life-Cycle Service. The modules are defined in IDL, and implemented in Java. The Sun Java IDL system, with its Door Orb implementation, is used to interface client programs to the Document Server implementation.</Paragraph>
      <Paragraph position="1"> The Document Management Service module provides methods to access and manipulate the components of objects (e.g., attributes, annotations and content of a document object).</Paragraph>
      <Paragraph position="2"> The Life-Cycle Service is responsible for creating and copying objects.</Paragraph>
      <Paragraph position="3"> The Naming Service binds a name to an object.</Paragraph>
      <Paragraph position="4"> The Naming Service supports a limited form of persistency for storing bindings.</Paragraph>
      <Paragraph position="5"> For example, to create a new document, the client program creates it through the Life-Cycle Service, bind a name to it using the Naming Service, and add attributes and annotations to it through the Document Management Service.</Paragraph>
      <Paragraph position="6"> The Document Server itself is accessed via its API and is running as a Java Door Orb supporting requests from the component's servers.</Paragraph>
      <Paragraph position="7"> This framework does not provide a model for controlling the interaction between the components of an application : the designer of an NLP application can use a simple sequential model or more sophisticated blackboard models : since this distributed model supports both the synchronous and the asynchronous types of communication between components, it supports a large variety of control models.</Paragraph>
    </Section>
    <Section position="2" start_page="249" end_page="250" type="sub_section">
      <SectionTitle>
4.2 Porting of the Temple Machine
Translation System
</SectionTitle>
      <Paragraph position="0"> To bootstrap the CoreUi Machine Translation System and test the implementation of the architecture, we are currently porting the CRL's Temple machine-translation system prototype (Vanni &amp; Zajac 96) to the Corelli architecture. This task will be aided by two features: first, the Temple system already utilizes the Tipster Document Architecture for data exchange between components, and second, the Temple system has a pipelined architecture which will  allow modular encapsulation of translation stages (e.g., dictionary lookup) as Corelli Plug'n Play tools. The Temple morphological analyzers and the English morphological generator all function as stand-alone executables and will be easily converted to Corelli Plug'n Play tools. Lexical resources (e.g., dictionaries and glossaries), on the other hand, are currently maintained in a database and are accessed via calls to a C library API. Each lexical resource is wrapped as a Plug'n Play tool implementing the query interface: in order to interface with the databases, the Java Native Interface is used to wrap the C database library. Finally, we will have to reengineer a portion of the top-level application control code (in C) in Java.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML