File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-2215_relat.xml
Size: 14,611 bytes
Last Modified: 2025-10-06 14:15:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2215"> <Title>PolyphraZ : a tool for the management of parallel corpora</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 TraCorpEx and PolyphraZ </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Context </SectionTitle> <Paragraph position="0"> The TraCorpEx project has several contexts: the Papillon project (Papillon) of co-operative construction of a large multilingual lexical base on the Web, the C-STAR III project (C-STAR III) of translation of spoken dialogues, a French and Tunisian project (Hajlaoui, Boitet, 2003b), the UNL project (UNL) of communication and multilingual information system, and the PhD research of the various participants in this project.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Current data and problems </SectionTitle> <Paragraph position="0"> We have initially 2 &quot;parallel&quot; corpora, structured differently.</Paragraph> <Paragraph position="1"> ss The BTEC corpus of C-STAR is made of 5 sets of 163 files of 12K to 40K, each containing 1000 sentences, in English, Japanese (coded in EUC), Chinese and Korean, for a total of 6.1 Mo per language. ss The TANAKA corpus (Japanese-English), given to the Papillon project a few months before the death of its author in 2002, is made of 45 files for a total of 18.4 Mo. It contains sentences of newspapers or teaching works of NHK for the training of English by the Japanese. Each file is bilingual.</Paragraph> <Paragraph position="2"> We have also corpora from the UNL project, where each document is a multilingual file containing for each sentence its text in source language, a UNL graph, the result of deconversions in a certain number of languages, and possibly their revisions, or direct manual translations.</Paragraph> <Paragraph position="3"> All these &quot;parallel&quot; corpora are aligned at the level of sentences. As it would be interesting to show correspondences at finer levels (syntagms, chunks, words), we design PolyphraZ to later add tools for subsential alignement such as the one developed by Ch. Chenon for his Ph.D.</Paragraph> <Paragraph position="4"> In other corpora, we may be obliged to go up to the level of paragraphs, because sentences will not be aligned perfectly. That will not be done completely in PolyphraZ, but at the level of the structure of the multilingual document itself: if 2 sentences are translated by 3, each of the 5 sentences will be in a different polyphrase, with their individual translations, and there will be another polyphrase, of &quot;n-m&quot; type, to contain the 2 complete segements.</Paragraph> <Paragraph position="5"> The first problem we encounter with the available parallel corpora it is that there is no tool to visualize their contents at a glance, sentence by sentence, nor to show the fine correspondences between subsentential segments. In addition, in the case of UNL documents, we cannot visualize at the same time a sentences in several languages and its corresponding UNL graph. Lastly, it is not possible to see successive versions in parallel.</Paragraph> <Paragraph position="6"> When it comes to evaluation, we can only see the monolingual files, and associated statistical measurements (NIST, BLEU...), but we can never confront them with the real translations and make a direct subjective evaluation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Detailed objectives </SectionTitle> <Paragraph position="0"> The objectives of TraCorpEx project are as follows.</Paragraph> <Paragraph position="1"> 2.3.1 Construction of a software platform We want to build an environment, which supports the import and the export of parallel corpora, the preparation of the data for automatic translators, the postedition (HAMT), the evaluation (various feedbacks methods) and finally a preparation of &quot;feedbacks&quot; to the developers of used MT systems.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3.2 Addition of new languages </SectionTitle> <Paragraph position="0"> Starting from parallel corpora, we want to add one or more languages (those of the Papillon project for the Tanaka corpus, French and Arabic for the BTEC corpus).</Paragraph> <Paragraph position="1"> We also wish that the same platform makes it possible to evaluate automatic translators with automatic methods such as NIST, BLEU, PER, and to use this possibility in CSTAR, to evaluate the Chinese-English and Japanese-English translations. To evaluate the results of various MT systems will also enable us to determine &quot;the best&quot; (or less bad!) translation, proposable to a contributor as a starting point for revision.</Paragraph> <Paragraph position="2"> We also want to test a hypothesis by the second author: the quality of the translations could also be evaluated using calculations of distances between sentences and reverse translations.</Paragraph> <Paragraph position="3"> 2.3.4 Feedbacks to developers of MT systems We also want to give feedbacks to the developers of the systems used (unknown words, badly translated sentences...), and a comparative presentation between the various translation systems.</Paragraph> <Paragraph position="4"> The whole of the objectives of this project led us to propose interactive Web interfaces allowing us to chooses, use, compare, publish machine translations corresponding to several language pairs, and to contribute to the improvement of the results by sending feedbacks to the developers of these systems.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 The PolyphraZ platform </SectionTitle> <Paragraph position="0"> PolyphraZ is a software platform making it possible at the same time to visualize the available corpora on the Web by showing several languages, with the choice of the user and to work on a basis of &quot;polyphrases&quot; initialised from these corpora while making it possible to control all functions described above (call of MT systems, distance computation, collaborative postedition, evaluation).</Paragraph> <Paragraph position="1"> We follow the software architecture of the Papillon platform.</Paragraph> <Paragraph position="2"> We classify the objects to handle in three types * Raw corpus sources * Sources transformed into our XML format CXM. (Common Example Markup) and coded in UTF-8, for visualization &quot;just as they are&quot;, then in CPXM format, DTD for parallel visualization.</Paragraph> <Paragraph position="3"> * MPM: multilingual polyphrase memory We distinguish four principal users: the preparer, the reader (&quot;normal&quot; user), the posteditor and the manager.</Paragraph> <Paragraph position="4"> ss The preparer His role consists in calling translation systems, thereby parameterizing them as well as possible, which supposes a certain linguistic ability (to compare the results of various parameter settings, and of various segmentations in &quot;blocks&quot;, each corresponding to some parameter settings).</Paragraph> <Paragraph position="5"> The preparer can also call objective evaluation methods (NIST, BLEU...) on the results of translation, tune with parameters to compute distances between sentences (results of translation and/or reverse translations), and post the results. The distance computation produces, in addition to a value, a XML string from which a &quot;track changes&quot; presentation can be generated. The preparer can also set the parameters determining &quot;the best&quot; suggestion among the various translation candidates.</Paragraph> <Paragraph position="6"> ss The reader (normal user) A reader can visualize the data (the original, various translations, and distances between the character strings) through Web interfaces, but is not allowed to edit the translations.</Paragraph> <Paragraph position="7"> ss The translator-posteditor The translator-posteditor is a contributor who translates from scratch or revises proposed translations (MT results or translations of similar sentences found in the MPM or in other TM put in CPXM or MPM format). There is an editable area to modify the active sentence. One can also ask for global modifications (ex: &quot;SVP&quot; changed into &quot;s'il vous plait&quot; in transcribed spoken utterances) and correct or supplement the local dictionary attached to the MPM. The system uses the reference sentences already produced like a translation memory.</Paragraph> <Paragraph position="8"> PolyphraZ is thus also a system of assistance to the translator, limited to the translation of sets of sentences (or titles), with less functionalities than commercial TWS, but usable for collaborative volunteer work by nonprofessionals. null ss The manager The last type of user is the manager, who will produce from a MPM &quot;feedbacks&quot; for the developers of the MT systems used. A manager cans himself be a developer of an MT system. He can draw up a list of unknown words and words badly translated by each system (produced from the traces of distance computations). A second function is to propose for these words suggestions of translation from the &quot;reference&quot; translations obtained after human revision. Finally, it is possibile to provide a presentation of the evaluations and comparisons between the results of the various systems used and/or their various parameter settings.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4.3 Implementation of PolyphraZ </SectionTitle> <Paragraph position="0"> Programmed in standard Java under the Enhydra development environment used for the dynamic and multilingual Papillon web site, PolyphraZ is multi-platform (MacOS-X/Unix/Linux, Windows).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Scenarios </SectionTitle> <Paragraph position="0"> The use of PolyphraZ can be divided in 3 parts: setting of the data under three different formats (CXM, CPXM, MPM).</Paragraph> <Paragraph position="1"> Figure 2 : scenarios for using PolyphraZ In order to manipulate a single format (XML) and a single encoding (UTF-8), we automatically convert into the CXM format the imported data (corpus, text aligned...). CDM is defined in the same spirit as the CDM (Common Dictionnary Markup) of the Papillon project.</Paragraph> <Paragraph position="3"> A second Java program transforms all CXM files corresponding to a given multilingual parallel corpus of sentences to the CPXM format (see appendix 2). In this format, we introduce the &quot;polyphrase&quot; XML element, which is a set of monolingual components, each containing possibly one or more proposals.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5.3 MPM.dtd (Multilingual Polyphrase Memory) </SectionTitle> <Paragraph position="0"> The MPM data structure is under construction. It is intended for the management of the correspondences between the various linguistic versions as well as the modifications which can be made, and to keep the history of the modified files. As shown in the following figure, a MPM of PolyphraZ can contain a set of versions and alternatives of the sentences, as well as the results of various computations.</Paragraph> <Paragraph position="1"> Figure 4 : logical view of a MPM We give a first version of the MPM DTD in appendix 3.</Paragraph> <Paragraph position="2"> PolyphraZ can visualize polyphrases in parallel from corpora in CPXM or MPM formats. This functionality is useful to compare translations, and is made available to readers; translators revisors, and managers.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.6 Evaluation of translation results </SectionTitle> <Paragraph position="0"> We have programmed and integrad in PolyphraZ three evaluation methods (NIST, BLEU and distance calculation). NIST and BLEU are well known. Let us give more details about distance calculation between 2 sentences.</Paragraph> <Paragraph position="1"> The distance we compute between two strings is a linear combination of two edit distances, one at the level of characters, the other at the level of words. In general, the edit distance between two strings P1 and P2 of atoms (characters or words here) is the minimal number of suppressions, insertions or replacements of atoms necessary to transform P1 into P2 or, equivalently, P2 into P1. To compute the edit distance between P1 and P2 at the level of words, one segments them into words, computes the character distances between words of P1 and words of P2, and then computes the word distance using words as &quot;large characters&quot;. We use the well-known dynamic programming algorithm of (Wagner, Fischer, 1974). To combine the two levels (characters and words), we use the formula: This representation corresponds to the presentation used by Microsoft Word in &quot;Track changes&quot; mode. It is very readable. In certain cases, the representation at the level of the characters is more compact and readable that at the level of words, while it is the opposite in other cases. In fact, this representation is not &quot;faithful&quot; to the trace, because a sequence of exchanges is transformed into a sequence of suppressions and a sequence of insertions.</Paragraph> <Paragraph position="2"> aisympablthique Figure 7:&quot;Track changes&quot; display One interesting and today unsolved problem is how to merge the 2 levels: given 2 sentences and their character and word edit distances, necessarily both minimal, how to produce a trace which would be &quot;the best&quot; or &quot;a best&quot; combination of the 2 traces? Figure 8 : 3 lines representation This representation is simpler to understand, but takes more space.</Paragraph> <Paragraph position="3"> O represents the exchange of a character by another, ||represents the equality between two characters O represent the suppression of the 1st character, Figure 9 : XML representation</Paragraph> </Section> </Section> class="xml-element"></Paper>