File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0824_metho.xml

Size: 11,761 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0824">
  <Title>RALI: SMT shared task system description</Title>
  <Section position="3" start_page="0" end_page="137" type="metho">
    <SectionTitle>
2 The core system
</SectionTitle>
    <Paragraph position="0"> We assembled up a phrase-based statistical engine by making use of freely available packages. The translation engine we used is the one suggested within the shared task: PHARAOH (Koehn, 2004).</Paragraph>
    <Paragraph position="1"> The input of this decoder is composed of a phrase-based model (PBM), a trigram language model and an optional set of coefficients and thresholds  500 top sentences of the DEV corpus in terms of WER (word error rate), SER (sentence error rate), NIST and BLEU scores.</Paragraph>
    <Paragraph position="2"> which control the decoder.</Paragraph>
    <Paragraph position="3"> For acquiring a PBM, we followed the approach described by Koehn et al. (2003). In brief, we relied on a bi-directional word alignment of the training corpus to acquire the parameters of the model. We used the word alignment produced by Giza (Och and Ney, 2000) out of an IBM model 2. We did try to use the alignment produced with IBM model 4, but did not notice significant differences over our experiments; an observation consistent with the findings of Koehn et al. (2003). Each parameter in a PBM can be scored in several ways. We considered its relative frequency as well as its IBM-model 1 score (where the transfer probabilities were taken from an IBM model 2 transfer table). The language model we used was the one provided within the shared task.</Paragraph>
    <Paragraph position="4"> We obtained baseline performances by tuning the engine on the top 500 sentences of the development corpus. Since we only had a few parameters to tune, we did it by sampling the parameter space uniformly. The best performance we obtained, i.e., the one which maximizes the BLEU metric as measured by the mteval script2 is reported for each pair of languages in Table 1.</Paragraph>
  </Section>
  <Section position="4" start_page="137" end_page="137" type="metho">
    <SectionTitle>
3 Smoothing PBMs with WORDNET
</SectionTitle>
    <Paragraph position="0"> Among the things we tried but which did not work well, we investigated whether smoothing the transfer table of an IBM model (2 in our case) with WORDNET would produce better estimates for rare words. We adapted an approach proposed by Cao et al. (2005) for an Information Retrieval task, and computed for any parameter (ei,fj) be-</Paragraph>
    <Paragraph position="2"> where E is the English vocabulary, pn designates the native distribution and pwn is the probability that two words in the English side are linked together. We estimated this distribution by co-occurrence counts over a large English corpus3.</Paragraph>
    <Paragraph position="3"> To avoid taking into account unrelated but co-occurring words, we used WORDNET to filter in only the co-occurrences of words that are in relation according to WORDNET. However, since many words are not listed in this resource, we had to smooth the bigram distribution, which we did by applying Katz smoothing (Katz, 1997):</Paragraph>
    <Paragraph position="5"> where .c(a,b|W,L) is the good-turing discounted count of times two words a and b that are linked together by a WORDNET relation, co-occur in a window of 2 sentences.</Paragraph>
    <Paragraph position="6"> We used this smoothed model to score the parameters of our PBM instead of the native transfer table. The results were however disappointing for both the G-E and S-E translation directions we tested. One reason for that, may be that the English corpus we used for computing the co-occurrence counts is an out-of-domain corpus for the present task. Another possible explanation lies in the fact that we considered both synonymic and hyperonymic links in WORDNET; the latter kind of links potentially introducing too much noise for a translation task.</Paragraph>
  </Section>
  <Section position="5" start_page="137" end_page="139" type="metho">
    <SectionTitle>
4 The German-English task
</SectionTitle>
    <Paragraph position="0"> We identified two major problems with our approach when faced with this pair of languages.</Paragraph>
    <Paragraph position="1"> First, the tendency in German to put verbs at the end of a phrase happens to ruin our phrase acquisition process, which basically collects any box of aligned source and target adjacent words. This 3For this, we used the English side of the provided training corpus plus the English side of our in-house Hansard bitext; that is, a total of more than 7 million pairs of sentences.  can be clearly seen in the alignment matrix of figure 1 where the verbal construction could clarify is translated by two very distant German words k&amp;quot;onnten and erl&amp;quot;autern. Second, there are many compound words in German that greatly dilute the various counts embedded in the PBM table.</Paragraph>
    <Paragraph position="2">  . . . . . . . . . . . . . x erl&amp;quot;autern . . . . . . . x . . . . .</Paragraph>
    <Paragraph position="3"> punkt . . . . . . . . . x . . .</Paragraph>
    <Paragraph position="4"> einen . . . . . . . . x . curlyleft . .</Paragraph>
    <Paragraph position="5"> mir . . . . . . . . . . . x .</Paragraph>
    <Paragraph position="6"> sie . . . . . x . . . . . . .</Paragraph>
    <Paragraph position="7"> oder . . . . x . . . . . . . .</Paragraph>
    <Paragraph position="8"> kommission . . . x . . . . . . . . .</Paragraph>
    <Paragraph position="9"> die . . x . . . . . . . . . .</Paragraph>
    <Paragraph position="10"> k&amp;quot;onnten . . . . . . x . . . . . .</Paragraph>
    <Paragraph position="11"> vielleicht . x . . . . . . . . . . .</Paragraph>
    <Paragraph position="13"> English perhaps the commission or you could clarify a point for me .</Paragraph>
    <Paragraph position="14"> German vielleicht k&amp;quot;onnten die kommission oder sie mir einen punkt erl&amp;quot;autern .</Paragraph>
    <Paragraph position="15">  in this matrix designates an alignment valid in both directions, while the curlyleft symbol indicates an uni-directional alignment (for has been aligned with einen, but not the other way round).</Paragraph>
    <Section position="1" start_page="138" end_page="138" type="sub_section">
      <SectionTitle>
4.1 Moving around German words
</SectionTitle>
      <Paragraph position="0"> For the first problem, we applied a memory-based approach to move around words in the German side in order to better synchronize word order in both languages. This involves, first, to learning transformation rules from the training corpus, second, transforming the German side of this corpus; then training a new translation model. The same set of rules is then applied to the German text to be translated.</Paragraph>
      <Paragraph position="1"> The transformation rules we learned concern a few (five in our case) verbal constructions that we expressed with regular expressions built on POS tags in the English side. Once the locus evu of a pattern has been identified, a rule is collected whenever the following conditions apply: for each word e in the locus, there is a target word f which is aligned to e in both alignment directions; these target words when moved can lead to a diagonal going from the target word (l) associated to eu[?]1 to the target word r which is aligned to ev+1.</Paragraph>
      <Paragraph position="2"> The rules we memorize are triplets (c,i,o) where c = (l,r) is the context of the locus and i and o are the input and output German word order (that is, the order in which the tokens are found, and the order in which they should be moved).</Paragraph>
      <Paragraph position="3"> For instance, in the example of Figure 1, the Verb Verb pattern match the locus could clarify and the following rule is acquired: (sie einen, k&amp;quot;onnten erl&amp;quot;autern, k&amp;quot;onnten erl&amp;quot;autern), a paraphrase of which is: &amp;quot;whenever you find (in this order) the word k&amp;quot;onnten and erl&amp;quot;autern in a German sentence containing also (in this order) sie and einen, move k&amp;quot;onnten and erl&amp;quot;autern between sie and einen.</Paragraph>
      <Paragraph position="4"> A set of 124 271 rules have been acquired this way from the training corpus (for a total of 157 970 occurrences). The most frequent rule acquired is (ich herrn, m&amp;quot;ochte danken, m&amp;quot;ochte danken), which will transform a sentence like &amp;quot;ich m&amp;quot;ochte herrn wynn f&amp;quot;ur seinen bericht danken.&amp;quot; into &amp;quot;ich m&amp;quot;ochte danken herrn wynn f&amp;quot;ur seinen bericht.&amp;quot;.</Paragraph>
      <Paragraph position="5"> In practice, since this acquisition process does not involve any generalization step, only a few rules learnt really fire when applied to the test material. Also, we devised a fairly conservative way of applying the rules, which means that in practice, only 3.5% of the sentences of the test corpus where actually modified.</Paragraph>
      <Paragraph position="6"> The performance of this procedure as measured on the development set is reported in Table 2. As simple as it is, this procedure yields a relative gain of 7% in BLEU. Given the crudeness of our approach, we consider this as an encouraging improvement. null</Paragraph>
    </Section>
    <Section position="2" start_page="138" end_page="139" type="sub_section">
      <SectionTitle>
4.2 Compound splitting
</SectionTitle>
      <Paragraph position="0"> For the second problem, we segmented German words before training the translation models. Empirical methods for compound splitting applied to  compound splitting approaches on the top 500 sentences of the development set.</Paragraph>
      <Paragraph position="1"> German have been studied by Koehn and Knight (2003). They found that a simple splitting strategy based on the frequency of German words was the most efficient method of the ones they tested, when embedded in a phrase-based translation engine. Therefore, we applied such a strategy to split German words in our corpora. The results of this approach are shown in Table 2.</Paragraph>
      <Paragraph position="2"> Note: Both the swapping strategy and the compound splitting yielded improvements in terms of BLEU score. Only after the deadline did we find time to train new models with a combination of both techniques; the results of which are reported in the last line of Table 2.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="139" end_page="139" type="metho">
    <SectionTitle>
5 The Finnish-English task
</SectionTitle>
    <Paragraph position="0"> The worst performances were registered on the Finnish-English pair. This is due to the agglutinative nature of Finnish. We tried to segment the Finnish material into smaller units (substrings) by making use of the frequency of all Finnish sub-strings found in the training corpus. We maintained a suffix tree structure for that purpose.</Paragraph>
    <Paragraph position="1"> We proceeded by recursively finding the most promising splitting points in each Finnish token of C characters FC1 by computing split(FC1 ) where:</Paragraph>
    <Paragraph position="3"> This approach yielded a significant degradation in performance that we still have to analyze.</Paragraph>
  </Section>
  <Section position="7" start_page="139" end_page="139" type="metho">
    <SectionTitle>
6 Submitted translations
</SectionTitle>
    <Paragraph position="0"> At the time of the deadline, the best translations we had were the baselines ones for all the language pairs, except for the German-English one where the moving of words ranked the best. This defined the configuration we submitted, whose results (as provided by the organizers) are reported in Table 3.</Paragraph>
    <Paragraph position="1">  the TEST corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML