XML Viewer - w02-1405

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1405_metho.xml
Size: 16,108 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1405">
  <Title>Improving a general-purpose Statistical Translation Engine by Terminological lexicons</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Our statistical engine
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The statistical models
</SectionTitle>
      <Paragraph position="0"> In this study, we built an SMT engine designed to translate from French to English, following the noisy-channel paradigm flrst described by (Brown et al., 1993b). This engine is based on equation 1, where e^I1 stands for the sequence of ^I English target words to be found, given a French source sentence of J words fJ1 :</Paragraph>
      <Paragraph position="2"> To train our statistical models, we assembled a bitext composed of 1.6 million pairs of sentences that were automatically aligned at the sentence level. In this experiment, every token was converted into lowercase before training.</Paragraph>
      <Paragraph position="3"> The language model we used is an interpolated trigram we trained on the English sentences of our bitext. The perplexity of the resulting model is fairly low { 65 {, which actually re ects the fact that this corpus contains many flxed expressions (e.g pursuant to standing order).</Paragraph>
      <Paragraph position="4"> The inverted translation model we used is an IBM2-like model: 10 iterations of IBM1training were run (reducing the perplexity of the training corpus from 7776 to 90), followed by 10 iterations of IBM2-training (yielding a flnal perplexity of 54). We further reduced the number of transfer parameters (originally 34969331) by applying an algorithm described in Foster (2000); this algorithm basically fllters in the pairs of words with the best gain, where gain is deflned as the difierence in perplexity | measured on a held-out corpus  |of a model trained with this pair of words and a model trained without. In this experiment, we worked with a model containing exactly the flrst gainranked million parameters. It is interesting to note that by doing this, we not only save memory, and therefore time, but also obtain improvments in terms of perplexity and overall perfor-</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The search algorithm
</SectionTitle>
      <Paragraph position="0"> The maximum operation in equation 1, also called search or decoding, involves a length model. We assume that the length (counted in words) of French sentences that translate an English sentence of a given length follow a normal distribution.</Paragraph>
      <Paragraph position="1"> We extended the decoder described by Nie...en et al. (1998) to a trigram language model. The basic idea of this search algorithm is to expand hypotheses along the positions of the target string while progressively covering the source ones. We refer the reader to the original paper for the recursion on which it relies, and instead give in Figure 1 a sketch of how a translation is built. An hypothesis h is fully determined by four parameters: its source (j) and target (i) positions of the last word (e), and its coverage (c). Therefore, the search space can be represented as a 4-dimension table, each item of which contains backtracking information (f for the fertility of e, bj and bw for the source position and the target word we should look at to backtrack) and the hypothesis score (prob).</Paragraph>
      <Paragraph position="2"> We know that better alignment models have been proposed and extensively compared (Och and Ney, 2000). We must however point out that the performance we obtained on the hansard corpus (see Section 3) is comparable to the rates published elsewhere on the same kind of corpus. In any case, our goal in this study is to compare the behavior of a SMT engine in both friendly and adverse situations. In our view, the present SMT engine is suitable for such a comparative study.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Tuning the decoder
</SectionTitle>
      <Paragraph position="0"> The decoder has been tuned in several ways in order to reduce its computations without detrimentally afiecting the quality of its output. The flrst thing we do when the decoder receives a sentence is to compute what we call an active vocabulary; that is, a collection of words which are likely to occur in the translation. This is done by ranking for each source word the target words according to their non normalized posterior likelihood (that is argmaxe p(fje)p(e), where p(e) is given by a unigram target language model, and p(fje) is given by the transfer Hansard sentences, we observed a reduction in word error rate of more than 3% with the reduced model.</Paragraph>
      <Paragraph position="1"> Input: f1 : : : fj : : : fJ Initialize the search space table Space Select a maximum target length: Imax Compute the active vocabulary // Fill the search table recursively: for all target position i = 1;2; : : : ; Imax do prune(i !1); for all alive hyp. h = Space(i; j; c; e) do</Paragraph>
      <Paragraph position="3"> for all w in bestWords do</Paragraph>
      <Paragraph position="5"> for all free source position d do s ^ prob; for all f 2 [1; fmax] = d + f ! 1 is free do</Paragraph>
      <Paragraph position="7"> // Find and return the best hypothesis if any maxs ^ !1 for all i 2 [1; Imax] do for all alive hyp. h = Space(i; j; c; e) do  s ^ Score(h) + log p(ijJ); if ((c == J) and (s &gt; maxs)) then maxs ^ s hmaxi; maxj; maxei ^ hi; j; ei if (maxs! = 1) then Return Space(maxi; maxj; J; maxe); else Failure Output: e1 : : : ei : : : emaxi  FreeSrcPositions returns the source positions not already associated to words of h; NBestTgtWords returns the list of words that are likely to follow the last bigram uv preceeding e according to the language model; and setIfBetter(i; j; c; e; p; f; bj; bw) is an operator that memorizes an hypothesis if its score (p) is greater than the hypothesis already stored in Space(i; j; c; e). a and t stands for the alignment and transfert distributions used by IBM2 models.</Paragraph>
      <Paragraph position="8"> probabilities of our inverted translation model) and keeping for each source word at most a target words.</Paragraph>
      <Paragraph position="9"> Increasing a raises the coverage of the active vocabulary, but also slows down the translation process and increases the risk of admitting a word that has nothing to do with the translation. We have conducted experiments with various a-values, and found that an a-value of 10 ofiers a good compromise.</Paragraph>
      <Paragraph position="10"> As mentioned in the block diagram, we also prune the space to make the search tractable. This is done with relative flltering as well as absolute thresholding. The details of all the flltering strategies we implemented are however not relevant to the present study.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Performances of our SMT engine
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Test corpora
</SectionTitle>
      <Paragraph position="0"> In this section we provide a comparison of the translation performances we measured on two corpora. The flrst one (namely, the hansard) is a collection of sentences extracted from a part of the Hansard corpus we did not use for training. In particular, we did not use any speciflc strategy to select these sentences so that they would be closely related to the ones that were used for training.</Paragraph>
      <Paragraph position="1"> Our second corpus (here called sniper) is an excerpt of an army manual on sniper training and deployment that was used in an EAR-LIER study (Macklovitch, 1995). This corpus is highly speciflc to the military domain and would certainly prove di-cult for any translation engine not speciflcally tuned to such material.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Overall performance
</SectionTitle>
      <Paragraph position="0"> In this section, we evaluate the performance of our engine in terms of sentence- and word- error rates according to an oracle translation2. The flrst rate is the percentage of sentences for which the decoder found the exact translation (that is, the one of our oracle), and the word error rate is computed by a Levenstein distance (counting the same penalty for both insertion, deletion and substitution edition operations). We realize that these measures alone are not su-cient for a serious evaluation, but we were re2Both corpora have been published in both French and English, and we took the English part as the gold standard.</Paragraph>
      <Paragraph position="1"> luctant in this experiment to resort to manual judgments, following for instance the protocol described in (Wang, 1998). Actually a quick look at the degradation in performance we observed on sniper is so clear that we feel these two rates are informative enough ! Table 1 summarizes the performance rates we measured. The WER is close to 60% on the hansard corpus and close to 74% on sniper; source sentences in the latter corpus being slightly longer on average (21 words). Not a single sentence was found to be identical to the gold standard translation on the sniper corpus  translator without any adjustments. jlengthj reports the average length (counted in words) of the source sentences and the standard deviation; nbs is the number of sentences in the corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Analyzing the performance drop
</SectionTitle>
      <Paragraph position="0"> As expected, the poor performance observed on the sniper text is mainly due to two reasons: the presence of out of vocabulary (OOV) words and the incorrect translations of terminological units.</Paragraph>
      <Paragraph position="1"> In the sniper corpus, 3.5% of the source tokens and 6.5% of the target ones are unknown to the statistical models. 44% of the source sentences and 77% of the target sentences contain at least one unknown word. In the hansard text, the OOV rates are much lower: around 0.5% of the source and target tokens are unknown and close to 5% of the source and target sentences contain at least one OOV words.</Paragraph>
      <Paragraph position="2"> These OOV rates have a clear impact on the coverage of our active vocabulary. On the sniper text, 72% of the oracle tokens are in the active vocabulary (only 0.5% of the target sentences are fully covered); whilst on hansard,  86% of the oracle's tokens are covered (24% of the target sentences are fully covered).</Paragraph>
      <Paragraph position="3"> Another source of disturbance is the presence of terminological units (TU) within the text to translate. Table 2 provides some examples of mistranslated TU from the sniper text. We also observed that many words within terminological units are not even known by the statistical models. Therefore accounting for terminology is one of the ways that should be considered to reduce the impact of OOV words.</Paragraph>
      <Paragraph position="4"> &lt; source term / oracle / translation&gt; &lt;^ame / bore / heart&gt; &lt;huile polyvalente / general purpose oil / oil polyvalente&gt; &lt;chambre / chamber / house of common&gt; &lt;tireur d' Pelite / sniper / issuer of elite&gt; &lt;la longueur de la crosse / butt length / the length of the crosse&gt;</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Integrating non-probabilistic
</SectionTitle>
    <Paragraph position="0"> terminological resources Using terminological resources to improve the quality of an automatic translation engine is not at all a new idea. However, we know of very few studies that actually investigated this avenue in the fleld of statistical machine translation. Among them, (Brown et al., 1993a) have proposed a way to exploit bilingual dictionnaries at training time. There may also be cases where domain-speciflc corpora are available which allow for the training of specialized models that can be combined with the general ones.</Paragraph>
    <Paragraph position="1"> Another approach that would not require such material at training time consists in designing an adaptative translation engine. For instance, a cache-based language model could be used instead of our static trigram model.</Paragraph>
    <Paragraph position="2"> However, the design of a truly adaptative translation model remains a more speculative enterprise. At the very least, it would require a fairly precise location of errors in previously translated sentences; and we know from the AR-CADE campaign on bilingual alignments, that accurate word alignments are di-cult to obtain (VPeronis and Langlais, 2000). This may be even more di-cult in situations where errors will involve OOV words.</Paragraph>
    <Paragraph position="3"> We investigated a third option, which involves taking advantage { at run time { of existing terminological resources, such as Termium4. As mentioned by Langlais et al. (2001), one of a translator's flrst tasks is often terminological research; and many translation companies employ specialized terminologists. Actually, aside from the infrequent cases where, in a given thematic context, a word is likely to have a clearly preferred translation (e.g. bill/facture vs bill/projet de loi), lexicons are often the only means for a user to in uence the translation engine.</Paragraph>
    <Paragraph position="4"> Merging such lexicons at run time ofiers a complementary solution to those mentioned above and it should be a fruitful strategy in situations where terminological resources are not available at training time (which may often be the case). Unfortunately, integrating terminological (or user) lexicons into a probabilistic engine is not a straightforward operation, since we cannot expect them to come with attached probabilities. Several strategies do come to mind, however. For instance, we could credit a translation of a sentence that contains a source lexicon entry in cases it contains an authorized translation. But this strategy may prouve difflcult to tune since decoding usually involves many flltering strategies.</Paragraph>
    <Paragraph position="5"> The approach we adopted consists in viewing a terminological lexicon as a set of constraints that are employed to reduce the search space. For instance, knowing that sniper is a sanctioned translation of tireur d'Pelite, we may require that current hypotheses in the search space associate the target word sniper with the three source French words.</Paragraph>
    <Paragraph position="6"> In our implementation, we had to slightly modify the block diagram of Figure 1 in order to: 1) forbid a given word ei from being associated with a word belonging to a source terminological unit, if it is not sanctioned by the lexicon; and 2) add at any target position an hypothesis linking a target lexicon entry to its source counterpart. Whether these hypotheses will survive intact will depend on constraints imposed by the maximum operation (of equation 1) over the full translation.</Paragraph>
    <Paragraph position="7"> The score associated with a target entry ei0i</Paragraph>
    <Paragraph position="9"> The rationale behind this equation is that both the language (p) and the alignment (a) models have some information that can help to decide the appropriateness of an extension: the former knows how likely it is that a word (known or not) will follow the current history5; and the latter knows to some extent where the target unit should be (regardless of its identity). In the absence of a better mechanism (e.g. a cache-model should be worth a try) We hope that this will be su-cient to determine the flnal position of the target unit in a given hypothesis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML