File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-4019_intro.xml
Size: 2,810 bytes
Last Modified: 2025-10-06 14:03:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-4019"> <Title>Outilex, a Linguistic Platform for Text Processing</Title> <Section position="4" start_page="0" end_page="73" type="intro"> <SectionTitle> 2 Introduction </SectionTitle> <Paragraph position="0"> The Outilex Project (Blanc et al., 2006) aims to develop an open-linguistic platform, including tools, electronic dictionaries and grammars, dedicated to text processing. It is the result of the collaboration of ten French partners, composed of 4 universities and 6 industrial organizations. The project started in 2002 and will end in 2006. The platform which will be made freely available to research, development and industry in April 2007, comprises software components implementing all the fundamental operations of written text processing: text segmentation, morphosyntactic tagging, parsing with grammars and language resource management.</Paragraph> <Paragraph position="1"> All Language Resources are structured in XML formats, as well as binary formats more adequate to efficient processing; the required format converters are included in the platform. The grammar formalism allows for the combination of statistical approaches with resource-based approaches.</Paragraph> <Paragraph position="2"> Manually constructed lexicons of substantial coverage for French and English, originating from the former LADL1, will be distributed with the platform under LGPL-LR2 license.</Paragraph> <Paragraph position="3"> The platform aims to be a generalist base for diverse processings on text corpora. Furthermore, it uses portable formats and format converters that would allow for combining several software components. There exist a lot of platforms dedicated to NLP, but none are fully satisfactory for various reasons. Intex (Silberztein, 1993), FSM (Mohri et al., 1998) and Xelda3 are closed source. Unitex (Paumier, 2003), inspired by Intex has its source code under LGPL license4 but it does not support standard formats for Language Resources (LR).</Paragraph> <Paragraph position="4"> Systems like NLTK (Loper and Bird, 2002) and Gate (Cunningham, 2002) do not offer functionality for Lexical Resource Management.</Paragraph> <Paragraph position="5"> All the operations described below are implemented in C++ independent modules which interact with each others through XML streams.</Paragraph> <Paragraph position="6"> Each functionality is accessible by programmers through a specified API and by end users through binary programs. Programs can be invoked by a Graphical User Interface implemented in Java.</Paragraph> <Paragraph position="7"> This interface allows the user to define his own processing flow as well as to work on several projects with specific texts, dictionaries and grammars. null</Paragraph> </Section> class="xml-element"></Paper>