File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2210_metho.xml
Size: 16,042 bytes
Last Modified: 2025-10-06 14:09:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2210"> <Title>A Generic Collaborative Platform for Multilingual Lexical Database Development</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Papillon project </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Motivations </SectionTitle> <Paragraph position="0"> Initially launched in 2000 by a French-Japanese consortium, the Papillon project1 (Serasset and Mangeot-Lerebours, 2001) rapidly extended its original goal -- the development of a rich French Japanese lexical database -- to its actual goal -the development of an Acception based Multilingual Lexical Database (currently tackling Chinese, English , French, German, Japanese, Lao, Malay, Thai and Vietnamese).</Paragraph> <Paragraph position="1"> This evolution was motivated in order to: * reuse many existing lexical resources even the ones that do not directly involve both initial languages, * be reusable by many people on the Internet, hence raising the interest of others in its development, * allow for external people (translator, native speakers, teachers...) to contribute to its development, For this project, we chose to adopt as much as possible the development paradigm of LINUX and GNU software2, as we believe that the lack of high level, rich and freely accessible multi-lingual lexical data is one of the most crucial obstacle for the development of a truly multilingual information society3.</Paragraph> <Paragraph position="2"> 2i.e. allowing and encouraging external users to access and contribute to the database.</Paragraph> <Paragraph position="3"> 3i.e. an Information Society with no linguistic domination and where everybody will be able to access any content in its own mother tongue.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Papillon acception based </SectionTitle> <Paragraph position="0"> multilingual database The Papillon multilingual database has been designed independently of its usage(s). It consists in several monolingual volumes linked by way of a single interlingual volume called the interlingual acception dictionary.</Paragraph> <Paragraph position="1"> MLDB, showing the handling of contractive problems.</Paragraph> <Paragraph position="2"> Each monolingual volume consists in a set of word senses (lexies), each lexie being described using a structure derived from the Explanatory and Combinatory Dictionary (Mel'cuk et al., 1995; Mel'cuk et al., 1984 1989 1995 1996). The interlingual acception dictionary consists in a set of interlingual acceptions (axies) as defined in (Serasset, 1994). An interlingual acception serves as a placeholder bearing links to lexies and links between axies4. This simple mechanism allows for the coding of translations. As an example, figure 1 shows how we can represent a quadrilingual database with contrastive problems (on the well known &quot;rice&quot; example).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Development methodology </SectionTitle> <Paragraph position="0"> The development of the Papillon multilingual dictionary gathers voluntary contributors and trusted language specialist involved in different tasks (as shown in figure 2).</Paragraph> <Paragraph position="1"> * First, an automatic process creates a draft acception based multilingual lexical database from existing monolingual and bilingual lexical resources as shown in (Teeraparseree, 2003; Mangeot-Lerebours et al., 2003). This step is called the bootstrapping process.</Paragraph> <Paragraph position="2"> the Papillon database.</Paragraph> <Paragraph position="3"> * Then, contributions may be performed by volunteers or trusted language specialists. A contribution is either the modification of an entry, its creation or its deletion. Each contribution is stored and immediately available to others.</Paragraph> <Paragraph position="4"> * Volunteers or language specialist may validate these contributions by ranking them. * Finally, trusted language specalists will integrate the contribution and apply them to the master MLDB. Rejected contributions won't be available anymore.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 The Papillon Platform </SectionTitle> <Paragraph position="0"> The Papillon platform is a community web site specifically developed for this project. This platform is entirely written in Java using the &quot;Enhydra5&quot; web development Framework. All XML data is stored in a standard relational database (Postgres). This community web site proposes several services: * a unified interface to simultaneously access the Papillon MLDB and several other monolingual and bilingual dictionaries; * a specific edition interface to contribute to the Papillon MLDB, * an open document repository where registered users may share writings related to the project; among these documents, one Sections 3 and 4 present the first and second services.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Unified access to existing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> dictionaries 3.1 Presentation </SectionTitle> <Paragraph position="0"> sults from three different dictionaries To encourage volunteers, we think that it is important to give a real service to attract as many Internet users as possible. As a result, we began our development with a service to allow users to access to many dictionaries in a unified way. This service currently gives access to twelve (12) bilingual and monolingual dictionaries, totalizing a little less than 1 million entries, as detailled in table 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Strong points </SectionTitle> <Paragraph position="0"> The unified access interface allows the user to access simultaneously to several dictionaries with different structures. All available dictionary will be queried according to its own structure. Moreover, all results will be displayed in a form that fits its own structure.</Paragraph> <Paragraph position="1"> Any monolingual, bilingual or multilingual dictionary may be added in this collection, provided that it is available in XML format.</Paragraph> <Paragraph position="2"> With the Papillon platform, giving access to a new, unknown, dictionary is a matter of writing 2 XML files: a dictionary description and an aJapanese French dictionary of armament from the French Embassy in Japan bChinese English from Mandel Shi (Xiamen univ.) c(Richter, 1999) d(Paik and Bond, 2003) e(Gut et al., 1996) fUniversity Stendhal, Grenoble III g(Breen, 2004a) h(Breen, 2004b) iThai Dictionary of Kasetsart University j(Duc, 1998) k(Apel, 2004) XSL stylesheet. For currently available dictionaries, this took an average of about one hour per dictionary.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Implementation </SectionTitle> <Paragraph position="0"> It is possible to give access to any XML dictionary, regardless of its structure. For this, you have to identify a minimum set of information in the dictionary's XML structure.</Paragraph> <Paragraph position="1"> The Papillon platform defines a standard structure of an abstract dictionary containing the most frequent subset of information found in most dictionaries. This abstract structure is called the Common Dictionary Markup (Mangeot-Lerebours and Serasset, 2002). To describe a new dictionary, one has to write an XML file that associate CDM element to pointers in the original dictionary structure.</Paragraph> <Paragraph position="2"> As an example, the French English Malay FeM dictionary (Gut et al., 1996) has a specific structure, illustrated by figure 4.</Paragraph> <Paragraph position="3"> Figure 5 gives the XML code associating elements of the FeM dictionary with elements of fine an XSL style sheet that will be applied on requested dictionary elements to produce the HTML code that defines the final form of the result. If such a style sheet is not provided, the Papillon platform will itself transform the dictionary structure into a CDM structure (using the aforementioned description) and apply a generic style sheet on this structure.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Editing dictionaries entries </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Presentation </SectionTitle> <Paragraph position="0"> As the main purpose of the Papillon platform is to gather a community around the development of a dictionary, we also developed a service for the edition of dictionary entries.</Paragraph> <Paragraph position="1"> HTML interface Any user, who is registered and logged in to the Papillon web site, may contribute to the Papillon dictionary6 by creating or editing7 an entry. Moreover, when a user asks for an unknown word, he is encouraged to contribute it to the dictionary.</Paragraph> <Paragraph position="2"> Contribution is made through a standard HTML interface (see figure 6). This interface is rather crude and raises several problems. For instance, there is no way to copy/paste part of an existing entry into the edition window. Moreover, editing has to be done on-line8. However, as the interface uses only standard HTML elements with minimal javascript functionality, it may be used with any Internet browser on any platform (provided that the browser/platform correctly handles unicode forms).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Strong points </SectionTitle> <Paragraph position="0"> From the beginning, we wanted this interface to be fully customizable by Papillon members on the server, but there is currently no specialized interface for off-line edition, meaning that users will have to use standard text/XML editor for this.</Paragraph> <Paragraph position="1"> without relying on the availability of a computer science specialist. our reasons are: * the fact that we wanted the structure of the Papillon dictionary to be adaptable along with the evolution of the project, without implying a full revisit of the web site implementation; null * the fact that each language may slightly adapt the Papillon structure to fit its own needs (specific set of part of speech, language levels, etc.), hence adding a new dictionary implies adding a new custom interface; null Hence, we chose to develop a system capable of generating a usable interface from a) a description of the dictionary structure (an XML Schema) and b) a description of the mapping between element of the XML structure and standard HTML inputs.</Paragraph> <Paragraph position="2"> For this, we used the ARTStudio tool described by (Calvary et al., 2001). Using a tool that allows for the development of plastic user interfaces allows us to generate not only one, but several interfaces on different devices. Hence, as we are now able to generate an HTML interface usable with any standard web browser supporting Unicode, we may, in the future, generate interfaces for Java applications (that can be used offline) or interfaces for portable devices like pocket PCs or Palm computers.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Implementation 4.3.1 Definition of the dictionary </SectionTitle> <Paragraph position="0"> structure To provide an edition interface, the Papillon platform needs to know the exact dictionary structure. The structure has to be defined as a standard XML schema. We chose to use XML schema because it allows for a finer description compared to DTDs (for instance, we may define the set of valid values of the textual content of an XML element). Moreover XML schemata provides a simple inheritance mechanism that is useful for the definition of a dictionary. For instance, we defined a general structure for the Papillon dictionary (figure 7) and used the inheritance mechanism to refine this general structure for each language (as in figure 8).</Paragraph> <Paragraph position="1"> Describing the interface is currently the most delicate required operation. The first step is to define the set of elements that will appear in the umes of the Papillon dictionary; showing the part of speech element pos defined as a textual element.</Paragraph> <Paragraph position="2"> <simpleType name=&quot;posType&quot;> <restriction base=&quot;d:posType&quot;> <enumeration value=&quot;n.m.&quot; /> <enumeration value=&quot;n.m. inv.&quot; /> <enumeration value=&quot;n.m. pl.&quot; /> <enumeration value=&quot;n.m., f.&quot; /> <enumeration value=&quot;n.f.&quot; /> <enumeration value=&quot;n.f. pl.&quot; /> of speech pos element in the Papillon French definition.</Paragraph> <Paragraph position="3"> interface and their relation with the dictionary structure. Each such element is given a unique ID. This step defines an abstract interface where all elements are known, but not their layout, nor their kind.</Paragraph> <Paragraph position="4"> This step allows for the definition of several different tasks for the edition of a single dictionary. null The second step is to define the concrete realization and the position of all these elements. For instance, in this step, we specify the POS element to be rendered as a menu. Several kind of widgets are defined by ARTStudio. Among them, we find simple HTML inputs like text boxes, menus, check-boxs, radio buttons, labels..., but we also find several high level elements like generic lists of complex elements. As an simple example, we will see how the pos (part of speech) element is rendered in the Papillon interface. First, there will be an interface element (called S.364) related to the pos element (figure 9). Second, this element will be realized in our interface as a comboBox (figure 10).</Paragraph> <Paragraph position="5"> ement associated to the pos element. This element will display/edit value of type posType defined in the aforementioned schema.</Paragraph> <Paragraph position="6"> the pos element.</Paragraph> <Paragraph position="7"> Using this technique is rather tricky as there is currently no simple interface to generate these rather complex descriptions. However, using these separate description allows the definition of several edition tasks (depending on the user profile) and also allows, for a single task, to generate several concrete interfaces, depending on the device that will be used for edition (size of the screen, methods of interactions, etc.).</Paragraph> <Paragraph position="8"> Using the describe structure of the dictionary, we are able to generate an empty dictionary entry containing all mandatory elements. Then, we walk this structure and instantiate all associated widgets (in our case HTML input elements), as defined in the interface description. This way, we are able to generate the corresponding HTML form.</Paragraph> <Paragraph position="9"> When the user validates a modification, values of the HTML input elements are associated to the corresponding parts of the edited dictionary structure (this is also the case if the user asks for the addition/suppression of an element in the structure). Then, we are able to regenerate the interface for the modified structure. We iterate this step until the user saves the modified structure.</Paragraph> </Section> </Section> class="xml-element"></Paper>