File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2003_metho.xml
Size: 8,857 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2003"> <Title>LinguaStream: An Integrated Environment for Computational Linguistics Experimentation</Title> <Section position="3" start_page="95" end_page="97" type="metho"> <SectionTitle> 2 The LinguaStream Platform </SectionTitle> <Paragraph position="0"> LinguaStreamisanintegratedexperimentationenvironment targeted to researchers in NLP. It allows complex experiments on corpora to be realised conveniently, using various declarative formalisms. Without appropriate tools, the development costs that are induced by each new experiment become a considerable obstacle to the experimental approach. In order to address this problem, LinguaStream facilitates the realisation of complex processes while calling for minimal technical skills.</Paragraph> <Paragraph position="1"> Its integrated environment allows processing streams to be assembled visually, picking individual components in a &quot;palette&quot; (the standard set contains about fifty components, and is easily extensibleusingaJavaAPI,amacro-componentsys- null tem, and templates). Some components are specifically targeted to NLP, while others solve various issues related to document engineering (especially to XML processing). Other components are to be used in order to perform computations on the annotations produced by the analysers, to visualise annotated documents, to generate charts, etc. Each component has a set of parameters that allow their behaviour to be adapted, and a set of input and/or output sockets, that are to be connected using pipes in order to obtain the desired processing stream (see figure 2). Annotations made on a single document are organised in independent layers and may overlap. Thus, concurrent and ambiguous annotations may be represented in order to be solved afterwards, by subsequent analysers.</Paragraph> <Paragraph position="2"> The platform is systematically based on XML recommendations and tools, and is able to process any file in this format while preserving its original structure. When running a processing stream, the platform takes care of the scheduling of sub-tasks, and various tools allow the results to be visualised conveniently.</Paragraph> <Paragraph position="3"> Fundamental principles First of all, the platform makes use of declarative representations, as often as possible, in order to define processing modules as well as their connections. Thus, available formalisms allow linguistic knowledge to be directly &quot;transcribed&quot; and used. Involved procedural mechanisms, committed to the platform, can be ignored. In this way, given rules are both descriptive (they provide a formal representation for a linguistic phenomenon) and operative (they can be considered as instructions to drive a computational process).</Paragraph> <Paragraph position="4"> Moreover, the platform takes advantage of the complementarity of analysis models, rather than considering one of them as &quot;omnipotent&quot;, that is to say, as able to express all constraint types.</Paragraph> <Paragraph position="5"> We indeed rely on the assumption that a complex analyser can successively adopt several points of view on the same linguistic data. Different formalisms and analysis models allow these different points of view. In a same processing stream, we can successively make use of regular expressions at the morphologic level, a local unification grammar at the phrasal level, finite state transducer at sentential level and constraint grammar for discourse level analysis. The interoperability between analysis models and the communication between components are ensured by a unified representation of markups and annotations. The latter are uniformly represented by feature sets, which are commonly used in linguistics and NLP, and allow rich and structured information representation. Every component can produce its own markup using preliminary markups and annota- null tions. Availableformalismsmakeitpossibletoexpress constraints on these annotations by means of unification. Thereby, the platform promotes progressive abstraction from surface forms. Insofaraseachstepcanaccesstoannotationsproduced null upstream, high level analysers often only use these annotations, ignoring raw textual data.</Paragraph> <Paragraph position="6"> Another fundamental aspect consists in the variability of analysis grain between different analysis steps. Many analysis models require a minimal grain to be defined, called token. For example, formalisms such as grammar or transducers need a textual unit (such as character or word) to which patterns are applied. When a component requires such a minimal grain, the platform allows to define locally the unit types which have to be considered as tokens. Any previously marked unit can be used as such: usual tokenisation in words or any other beforehand analysed elements (syntagms, sentences, paragraphs...). Theminimalunit may differ from an analysis step to another and the scope of the available analysis models is consequently increased. In addition, each analysis module indicates antecedent markups to which it refers and considers as relevant. Other markups can be ignored and it makes it possible to partially rise above textual linearity. Combining these functionalities, it is possible to define different points of view on the document for each analysis step.</Paragraph> <Paragraph position="7"> The modularity of processing streams promotes the reusability of components in various contexts: a given module, developed for a first processing stream may be used in other ones. In addition, every stream may be used as a single component, called macro-component, in a higher level stream. Moreover, for a given stream, each component may be replaced by any other functionally equivalent component. For a given subtask, a rudimentary prototype may in fine be replaced by an equivalent, fully operational, component. Thus, it is possible to compare processing results in rigourously similar contexts, which is a necessary condition for relevant comparisons.</Paragraph> <Section position="1" start_page="96" end_page="97" type="sub_section"> <SectionTitle> Analysis models </SectionTitle> <Paragraph position="0"> We indicated above some of the components whichmaybeusedinaprocessingstream. Among those which are especially dedicated to NLP, two categories have to be distinguished. Some of them consist in ready-made analysers linked to a specific task. For example, morpho-syntactic tagging (an interface with TreeTagger is provided by default) consists in such a task. Although some parameters allow to adapt the associated components to the task (tag set for a given language...), it is impossible to fundamentally modify their behaviour. Others, on the contrary, provide an analysis model, that is to say, firstly, a formalism for representing linguistic constraints by means of which the user can express expected processing. This formalism will usually rely on a specific operational model. These analysis models allow constraints to be expressed, on surface form as well as on annotations produced by the precedent analysers. All annotations are represented by featuresetsandtheconstraintsareencodedbyunification on these structures. Some of the available systems follow.</Paragraph> <Paragraph position="1"> * A system called EDCG (Extended-DCG) allows local unification grammars to be written, using the DCG (Definite Clause Grammars) syntax of Prolog. Such a grammar can be described in a pure declarative manner even if the features of the logical language may be accessed by expert users.</Paragraph> <Paragraph position="2"> * A system called MRE (Macro-Regular-Expressions) allows patterns to be described using finite state transducers on surface forms and previously computed annotations.</Paragraph> <Paragraph position="3"> Its syntax is similar to regular expressions commonly used in NLP. However, this formalism not only considers characters and words, but may apply to any previously delimited textual unit.</Paragraph> <Paragraph position="4"> * Another descriptive, prescriptive and declarative formalism called CDML (Constraint-Based Discourse Modelling Language) allows a constraint-based approach of formal description and computation of discourse structure. It considers both textual segments and discourse relations, and relies on expression and satisfaction of a set of primitive constraints (presence, size, boundaries...) on previously computed annotations.</Paragraph> <Paragraph position="5"> * A semantic lexicon marker, a configurable tokenizer (using regular expressions at the character level), a system allowing linguistic units to be delimited relying on the XML tags that are available in the original document, etc.</Paragraph> </Section> </Section> class="xml-element"></Paper>