File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1505_intro.xml
Size: 13,062 bytes
Last Modified: 2025-10-06 14:01:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1505"> <Title>SiSSA - An Infrastructure for NLP Application Development</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 SiSSA </SectionTitle> <Paragraph position="0"> representation and exchange of information (achieved using XML (Bray et al., 2000)).</Paragraph> <Paragraph position="1"> The difference between the third and the fourth element above is that the CORBA part specifies the details of the communication process without any reference to the linguistic characteristics of the integrable processors (this part could be largely reused in other non linguistic projects involving a distributed architecture); the specifically linguistic details are embedded in the XML documents passed between the processors.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 SiSSA Architecture </SectionTitle> <Paragraph position="0"> The central element in the SiSSA architecture is an autonomous application, called SiSSA Manager. It is autonomous since it takes the initiative in the management of the processing flow of the SiSSA system, where it mainly plays the role of client. Its main tasks are the following: a0 to interact with the Processor Repository (the place where information about processors known to SiSSA is stored) to take a census, activate and connect the processors notified to the system; a0 to present the system functionalities to the user by means of a web-based graphical in- null terface. To this end, the SiSSA Manager acts as a server with respect to the processors towards which it mediates the &quot;centralized&quot; GUI. Through the latter, the SiSSA Manager not only interprets the user's actions but also gives her/him a report on the ongoing processing, storing and presenting logs and status messages coming from the active processors; null a0 to manage and interpret the projects built by the user.</Paragraph> <Paragraph position="1"> The Processor Repository classifies the processors, by associating each of them to the appropriate class of linguistic processors (e.g., morphological analyzers, PoS taggers, etc.).2 The Processor Repository also provides functionalities for permanently storing the properties associated with the processors registered in the repository. Among them, the properties that specify the methods for activating a processor are crucial. As a matter of fact, the single processors must be active in order to be available for use by SiSSA. The activation of a processor takes place by means of an Activation Server3 reachable via CORBA at the URL stored in the Processor Repository and specifying the corresponding activation string.</Paragraph> <Paragraph position="2"> The information is stored in the repository using RDF4 and RDFS5 (Resource Description Guha, 2000).</Paragraph> <Paragraph position="3"> Framework and RDF Schema). RDF Schema makes available tools to check that the descriptions of the processors' characteristics comply with SiSSA Manager's constraints. The RDF specification of the processors made available in SiSSA is usually built using a graphical interface. The adoption of RDF and RDF Schema enhances the generality of SiSSA, by avoiding ad hoc languages for resource description, and ad hoc schemas for the validation of documents describing the processors.</Paragraph> <Paragraph position="4"> Turning to the processors stored in the repository, they mainly play the role of servers which are activated upon request by the SiSSA Manager. The goal of making available distributed architectures for projects is pursued through the adoption of CORBA (Common Object Request Broker Architecture - http://www.corba.</Paragraph> <Paragraph position="5"> org, developed by the OMG industry consortium), which acts as the glue keeping together the executable parts of SiSSA.6 To be available to SiSSA, processors must be registered in the Processor Repository. To this end, they must exhibit interfaces that comply with a set of specifications defined using the CORBA Interface Definition Language (IDL). Thus, providing the compliant interfaces is a necessary step towards integrating new processors within SiSSA.</Paragraph> <Paragraph position="6"> As to communication formats, the overall goals of SiSSA made the adoption of XML (Bray et al., 2000) a natural choice. Thus messages are exchanged in the form of XML documents of type process-data (see Section 2.3). These documents incorporate in a single structure: the object to be processed, and information relevant for the processing itself (metadata). The generality of such a format permits its use both for the communication between the SiSSA Manager and the processors, and for those directly taking place among the processors.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Communication Protocols </SectionTitle> <Paragraph position="0"> As said, SiSSA provides a set of formal IDL specifications which the interfaces of processors aiming at being integrated in the environment must adhere to. Such specifications model the interaction between the SiSSA Manager and the proces- null sors, as mediated by CORBA. They can be seen as a contract that the SiSSA Manager and the processors have to comply for their mutual integration to be successful. A UML diagram of the SiSSA IDL specifications is shown in Figure 1.</Paragraph> <Paragraph position="1"> The scenario of the cooperation between the SiSSA Manager and a generic processor can be described as follows: a0 the processor's activation server starts and connects on the CORBA bus as a named server at a specified URL (i.e, the corbaloc: URL stored in the Processor</Paragraph> <Paragraph position="3"> the CORBA bus, can contact the activation server using the corbaloc: URL specified in the Processor Repository; using the processor's activation string it can ask the server to activate the corresponding processor; null a0 from now on, the interaction takes place directly between the SiSSA Manager and the processors whose interface it obtained; a0 the SiSSA Manager can in this way act as a true manager, establishing and removing the connections between the processors according to the design of the processing flow decided by the user.</Paragraph> <Paragraph position="4"> In SiSSA, the communication is asynchronous, and is implemented by means of a flow of XML documents. The processors and the SiSSA Manager can be both the source and the target of communication. Moreover, each communication can have more than one target.</Paragraph> <Paragraph position="5"> Being a possible target of communications, each registrable processor provides the functionalities of the interface IObserver. The SiSSA Manager's way to establish/remove the relationships between processors according to the user requirements amounts to inserting/deleting observers into a processor's list of observers. Besides the communication related to the linguistic processing, other relevant communication flows concern error messages, and information tracing. Logs and messages directed to the user are managed through the interface IMsgMonitor. Finally, the interface IStateMonitor (provided by the SiSSA Manager) allows each processor to signal its callers the status of its own processing (an example of its use is shown in the bottom bar of the window shown in Figure 3).</Paragraph> <Paragraph position="6"> An important service provided by the SiSSA Manager is the XSL7 processing of XML documents. To this end, the SiSSA Manager provides the interface XSLProcServer, through which XSLProcessor (a processor specialized in XSL transformations) is made available.8 This feature allows the insertion of XSL transformations between any pair of processors, this way providing the possibility of adapting one processor's output to the requirement of the following one(s). This feature is of the utmost importance for augmenting SiSSA's capabilities of integrating and successfully making available a wide range of processors. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Communication and Representation Formats </SectionTitle> <Paragraph position="0"> Communications take place using a &quot;data container&quot; modeled by the interfaceIDataStream. An object that implements such interface is sent by a processor to each of its observers on completion of its processing.</Paragraph> <Paragraph position="1"> IDataStream is designed as a container rather than as a structured model of the data exchanged. The definition of structured models for data is completely independent from IDataStream, and is obtained through different means. Indeed, given that the contents of data streams are XML documents, their structure is made explicit by means of Document Type Definitions (DTDs).</Paragraph> <Paragraph position="2"> SiSSA is a development environment, meant to be open to the integration of new components, whereby the latter can differ among them along a number of dimension, including the input/output formats. At the same time, SiSSA should allow the user an adequate level of control over the intermediate results produced during the computation (i.e., the output of each processor). XML allows a representation of data which is transparent and accessible to the developer, without the need for her/him to know the details of the implementation of the single components. At the same time, it does not increase the complexity of the CORBA interfaces that encapsulate such data.</Paragraph> <Paragraph position="3"> The data defined in XML are associated to a document of type process-data. Each document of type process-data necessarily includes two parts: a0 linguistic data, usually corresponding to the result of the computation done by the source processor; a0 metadata. Their role is to specify: the level of analysis accomplished by the source processor (e.g., tokenisation, parsing, etc.); the unique identifier of the processor originating the data; further useful information about processing (time of execution, rules applied, etc.). Moreover, metadata make available a unique identifier for the process-data document. This is useful so to associate the input with the different output structures produced by the different processing steps.</Paragraph> <Paragraph position="4"> The linguistic data have to comply with the definitions specified for the different classes of processors. Such classes are identified by the attribute level-of-analysis present in the metadata (e.g., morphological analyzer, PoS tagger, chunk parser, etc.) and should take into account (at least to a certain extent) idiosyncrasies of specific processors. For instance, a morphological analyzer can adopt a set of category labels not entirely coincident with that of another morphological analyzer.</Paragraph> <Paragraph position="5"> Obviously, a structure that aims to carry linguistic data of different nature, and so differently represented, can become quite complex when the levels of analysis taken into consideration increase. Moreover, during the development phase, the problem arises of the integration of data structures relative to levels of analysis previously not taken into consideration, as well as of data structures idiosyncratic to processors belonging to some classes. The modular nature of the DTDs for XML allows a neat distinction among metadata, and data relative to classes of processors (idiosyncratic data). The former are described in a single DTD, defined as part of the resources internal to SiSSA, while the latter can be conveyed by various DTDs, possibly made available in SiSSA along with each processor.</Paragraph> <Paragraph position="6"> As said, each processor at the end of its processing makes available a document of type process-data, which contains exclusively the output data of the specific processor - and obviously the corresponding metadata. Such a document is a representation of the output of the processor that generated it, and does not contain any representation relative to previous levels of analysis, the input text or the history of the processing done so far. Thus, for efficiency reasons process-data are not incremental collection of all the data produced by the various processors. At the same time, the need to keep a link between the input test and the output produced by the system cannot be ignored. It is also reasonable that in certain situations (e.g., during testing and debugging) the structures produced by the intermediate processors, as well as the metadata of the various processors, are needed to show or save tracing information. In the proposed architecture, this task is accomplished by the SiSSA Manager, that can register itself as an observer of any processor; in this way it can access the processor output and show it to the user or build a tracing struc-</Paragraph> </Section> </Section> class="xml-element"></Paper>