File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3120_metho.xml
Size: 11,761 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3120"> <Title>Adri a de Gispert</Title> <Section position="3" start_page="0" end_page="143" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> This section describes the system procedure followed for the data provided.</Paragraph> <Section position="1" start_page="0" end_page="142" type="sub_section"> <SectionTitle> 2.1 Alignment </SectionTitle> <Paragraph position="0"> Given a bilingual corpus, we use GIZA++ (Och, 2003) as word alignment core algorithm. During word alignment, we use 50 classes per language estimated by 'mkcls', a freely-available tool along with GIZA++. Before aligning we work with lowercase text (which leads to an Alignment Error Rate reduction) and we recover truecase after the alignment is done.</Paragraph> <Paragraph position="1"> In addition, the alignment (in speci c pairs of languages) was improved using two strategies: Full verb forms The morphology of the verbs usually differs in each language. Therefore, it is interesting to classify the verbs in order to address the rich variety of verbal forms. Each verb is reduced into its base form and reduced POS tag as explained in (de Gispert, 2005). This transformation is only done for the alignment, and its goal is to simplify the work of the word alignment improving its quality.</Paragraph> <Paragraph position="2"> Block reordering (br) The difference in word order between two languages is one of the most signi cant sources of error in SMT. Related works either deal with reordering in general as (Kanthak et al., 2005) or deal with local reordering as (Tillmann and Ney, 2003). We report a local reordering technique, which is implemented as a pre-processing stage, with two applications: (1) to improve only alignment quality, and (2) to improve alignment quality and to infer reordering in translation. Here, we present a short explanation of the algorithm, for further details see Costa-juss a and Fonollosa (2006).</Paragraph> <Paragraph position="3"> pair of consecutive blocks whose target translation is swapped This reordering strategy is intended to infer the most probable reordering for sequences of words, which are referred to as blocks, in order to monotonize current data alignments and generalize re-ordering for unseen pairs of blocks.</Paragraph> <Paragraph position="4"> Given a word alignment, we identify those pairs of consecutive source blocks whose translation is swapped, i.e. those blocks which, if swapped, generate a correct monotone translation. Figure 1 shows an example of these pairs (hereinafter called Alignment Blocks).</Paragraph> <Paragraph position="5"> Then, the list of Alignment Blocks (LAB) is processed in order to decide whether two consecutive blocks have to be reordered or not. By using the classi cation algorithm, see the Appendix, we divide the LAB in groups (Gn,n = 1 ... N). Inside the same group, we allow new internal combination in order to generalize the reordering to unseen pairs of blocks (i.e. new Alignment Blocks are created). Based on this information, the source side of the bilingual corpora are reordered.</Paragraph> <Paragraph position="6"> In case of applying the reordering technique for purpose (1), we modify only the source training corpora to realign and then we recover the original order of the training corpora. In case of using Block Reordering for purpose (2), we modify all the source corpora (both training and test), and we use the new training corpora to realign and build the nal translation system.</Paragraph> </Section> <Section position="2" start_page="142" end_page="142" type="sub_section"> <SectionTitle> 2.2 Phrase Extraction </SectionTitle> <Paragraph position="0"> Given a sentence pair and a corresponding word alignment, phrases are extracted following the criterion in Och and Ney (2004). A phrase (or bilingual phrase) is any pair of m source words and n target words that satis es two basic constraints: words are consecutive along both sides of the bilingual phrase, and no word on either side of the phrase is aligned to a word out of the phrase.</Paragraph> <Paragraph position="1"> We limit the maximum size of any given phrase to 7. The huge increase in computational and storage cost of including longer phrases does not provide a signi cant improvement in quality (Koehn et al., 2003) as the probability of reappearance of larger phrases decreases.</Paragraph> </Section> <Section position="3" start_page="142" end_page="142" type="sub_section"> <SectionTitle> 2.3 Feature functions </SectionTitle> <Paragraph position="0"> Conditional and posterior probability (cp, pp) Given the collected phrase pairs, we estimate the phrase translation probability distribution by relative frequency in both directions.</Paragraph> <Paragraph position="1"> The target language model (lm) consists of an n-gram model, in which the probability of a translation hypothesis is approximated by the product of word n-gram probabilities. As default language model feature, we use a standard word-based 5gram language model generated with Kneser-Ney smoothing and interpolation of higher and lower order n-grams (Stolcke, 2002).</Paragraph> <Paragraph position="2"> The POS target language model (tpos) consists of an N-gram language model estimated over the same target-side of the training corpus but using POS tags instead of raw words.</Paragraph> <Paragraph position="3"> The forward and backwards lexicon models (ibm1, ibm1[?]1) provide lexicon translation probabilities for each phrase based on the word IBM model 1 probabilities. For computing the forward lexicon model, IBM model 1 probabilities from GIZA++ source-to-target alignments are used. In the case of the backwards lexicon model, target-to-source alignments are used instead.</Paragraph> <Paragraph position="4"> The word bonus model (wb) introduces a sentence length bonus in order to compensate the system preference for short output sentences.</Paragraph> <Paragraph position="5"> The phrase bonus model (pb) introduces a constant bonus per produced phrase.</Paragraph> </Section> <Section position="4" start_page="142" end_page="143" type="sub_section"> <SectionTitle> 2.4 Decoding </SectionTitle> <Paragraph position="0"> The search engine for this translation system is described in Crego et al. (2005) which takes into account the features described above.</Paragraph> <Paragraph position="1"> Using reordering in the decoder (rgraph) A highly constrained reordered search is performed by means of a set of reordering patterns (linguistically motivated rewrite patterns) which are used to extend the monotone search graph with additional arcs. See the details in Crego et al. (2006).</Paragraph> </Section> <Section position="5" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 2.5 Optimization </SectionTitle> <Paragraph position="0"> It is based on a simplex method (Nelder and Mead, 1965). This algorithm adjusts the log-linear weights in order to maximize a non-linear combination of translation BLEU and NIST: 10[?]</Paragraph> <Paragraph position="2"> imization is done over the provided development set for each of the six translation directions under consideration. We have experimented an improvement in the coherence between all the automatic gures by integrating two of these gures in the optimization function.</Paragraph> </Section> </Section> <Section position="4" start_page="143" end_page="144" type="metho"> <SectionTitle> 3 Shared Task Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> The data provided for this shared task corresponds to a subset of the of cial transcriptions of the European Parliament Plenary Sessions, and it is available through the shared task website at: http://www.statmt.org/wmt06/shared-task/.</Paragraph> <Paragraph position="1"> The development set used to tune the system consists of a subset (500 rst sentences) of the of cial development set made available for the Shared Task.</Paragraph> <Paragraph position="2"> We carried out a morphological analysis of the data. The English POS-tagging has been carried out using freely available TNT tagger (Brants, 2000). In the Spanish case, we have used the Freeling (Carreras et al., 2004) analysis tool which generates the POS-tagging for each input word.</Paragraph> </Section> <Section position="2" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 3.2 Systems con gurations </SectionTitle> <Paragraph position="0"> The baseline system is the same for all tasks and includes the following features functions: cp, pp, lm, ibm1, ibm1[?]1, wb, pb. The POStag target language model has been used in those tasks for which the tagger was available. Table 1 shows the reordering con guration used for each task.</Paragraph> <Paragraph position="1"> The Block Reordering (application 2) has been used when the source language belongs to the Romanic family. The length of the block is limited to 1 (i.e. it allows the swapping of single words). The main reason is that speci c errors are solved in the tasks from a Romanic language to a Germanic language (as the common reorder of Noun + Adjective that turns into Adjective + Noun). Although the Block Reordering approach task: br1 (br2) stands for Block Reordering application 1 (application 2); and rgraph refers to the reordering integrated in the decoder does not depend on the task, we have not done the corresponding experiments to observe its efciency in all the pairs used in this evaluation. The rgraph has been applied in those cases where: we do not use br2 (there is no sense in applying them simultaneously); and we have the tagger for the source language model available.</Paragraph> <Paragraph position="2"> In the case of the pair GeEn, we have not experimented any reordering, we left the application of both reordering approaches as future work.</Paragraph> </Section> <Section position="3" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 3.3 Discussion </SectionTitle> <Paragraph position="0"> Table 2 presents the BLEU scores evaluated on the test set (using TRUECASE) for each con guration.</Paragraph> <Paragraph position="1"> The of cial results were slightly better because a lowercase evaluation was used, see (Koehn and Monz, 2006).</Paragraph> <Paragraph position="2"> For both, Es2En and Fr2En tasks, br helps slightly. The improvement of the approach depends on the quality of the alignment. The better alignments allow to extract higher quality Alignment Blocks (Costa-juss a and Fonollosa, 2006). The En2Es task is improved when adding both br1 and rgraph. Similarly, the En2Fr task seems to perform fairly well when using the rgraph. In this case, the improvement of the approach depends on the quality of the alignment patterns (Crego et al., 2006). However, it has the advantage of delaying the nal decision of reordering to the overall search, where all models are used to take a fully informed decision.</Paragraph> <Paragraph position="3"> Finally, the tpos does not help much when translating to English. It is not surprising because it was used in order to improve the gender and number agreement, and in English there is no need. However, in the direction to Spanish, the tpos added to the corresponding reordering helps more as the Spanish language has gender and number agreement. null the test set for each con guration: rc stands for Reordering Con guration and refers to Table 1.</Paragraph> <Paragraph position="4"> The bold results were the con gurations submitted. null</Paragraph> </Section> </Section> class="xml-element"></Paper>