File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1063_metho.xml

Size: 19,336 bytes

Last Modified: 2025-10-06 14:08:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1063">
  <Title>Multi-Engine Machine Translation with Voted Language Model</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Notes on Evaluation
</SectionTitle>
    <Paragraph position="0"> We assume here that the MEMT works on a sentence-by-sentence basis. That is, it takes as input a source sentence, gets it translated by several OTSs, and picks up the best among translations it gets. Now a problem with using BLEU in this setup is that translations often end up with zero because model translations they refer to do not contain n-grams of a particular length.2 This would make impossible a comparison and selection among possible translations.</Paragraph>
    <Paragraph position="1"> 2In their validity study of BLEU, Reeder and White (2003) finds that its correlation with human judgments increases with the corpus size, and warns that to get a reliable score for BLEU, one should run it on a corpus of at least 4,000 words. Also Tate et al. (2003) reports about some correlation between BLEU and task based judgments.</Paragraph>
    <Paragraph position="2"> One way out of this, Nomoto (2003) suggests, is to back off to a somewhat imprecise yet robust metric for evaluating translations, which he calls mprecision.3 The idea of m-precision helps define what an optimal MEMT should look like. Imagine a system which operates by choosing, among candidates, a translation that gives a best m-precision. We would reasonably expect the system to outperform any of its component OTSs. Indeed Nomoto (2003) demonstrates empirically that it is the case.</Paragraph>
    <Paragraph position="3"> Moreover, since rFLMps and rALMps work on a sentence, not on a block of them, what h(*) relates to is not BLEU, but m-precision.</Paragraph>
    <Paragraph position="4"> Hogan and Frederking (1998) introduces a new kind of yardstick for measuring the effectiveness of MEMT systems. The rationale for this is that it is often the case that the efficacy of MEMT systems does not translate into performance of outputs that they generate. We recall that with BLEU, one measures performance of translations, not how often a given MEMT system picks the best translation among candidates. The problem is, even if a MEMT is right about its choices more often than a best component engine, BLEU may not show it. This happens because a best translation may not always get a high score in BLEU. Indeed, differences in BLEU among candidate translations could be very small.</Paragraph>
    <Paragraph position="5"> Now what Hogan and Frederking (1998) suggest is the following.</Paragraph>
    <Paragraph position="7"> where d(i,j) is the Kronecker delta function, which gives 1 if i = j and 0 otherwise. Here psm represents some MEMT system, psm(e) denotes a particular translation psm chooses for sentence e, i.e., psm(e) = Ps(e,J,l). se1 ...seM [?] J denotes a set of candidate translations. max here gives a translation with the highest score in m-precision. N is the number of source sentences. d(*) says that you get</Paragraph>
    <Paragraph position="9"> which is nothing more than Papineni et al. (2002)'s modified n-gram precision applied to a pair of a single reference and the associated translation. Sit here denotes a set of i-grams in t, v an i-gram. C(v,t) indicates the count of v in t. Nomoto (2003) finds that m-precision strongly correlates with BLEU, which justifies the use of m-precision as a replacement of BLEU at the sentence level.</Paragraph>
    <Paragraph position="10"> didates. d(psm) gives the average ratio of the times psm hits a right translation. Let us call d(psm) HF accuracy (HFA) for the rest of the paper.</Paragraph>
    <Paragraph position="11"> 4 LM perplexity and MEMT performance Now the question we are interested in asking is whether the choice of LM really matters. That is, does a particular choice of LM gives a better performing FLMps or ALMps than something else, and if it does, do we have a systematic way of choosing one LM over another? Let us start with the first question. As a way of shedding some light on the issue, we ran FLMps and ALMps using a variety of LMs, derived from various domains with varying amount of training data. We worked with 24 LMs from various genres, with vocabulary of size ranging from somewhere near 10K to 20K in words (see below and also Appendix A for details on train sets). LMs here are trigram based and created using an open source speech recognition tool called JULIUS.4 Now train data for LMs are collected from five corpora, which we refer to as CPC, EJP, PAT, LIT, NIKMAI for the sake of convenience. CPC is a huge set of semi-automatically aligned pairs of English and Japanese texts from a Japanese news paper which contains as many as 150,000 sentences (Utiyama and Isahara, 2002), EJP represents a relatively small parallel corpus of English/Japanese phrases (totaling 15,187) for letter writing in business (Takubo and Hashimoto, 1999), PAT is a bilingual corpus of 336,971 abstracts from Japanese patents filed in 1995, with associated translations in English (a.k.a NTCIR-3 PATENT).5 LIT contains 100 Japanese literary works from the early 20th century, and NIKMAI 1,536,191 sentences compiled from several Japanese news paper sources. Both LIT and NIKMAI are monolingual.</Paragraph>
    <Paragraph position="12"> Fig.1 gives a plot of HF accuracy by perplexity for FLMps's on test sets pulled out of PAT, EJP and CPC.6 Each dot there represents an FLMps with a particular LM plugged into it. The HFA of each  sentences, that from PAT contains 4,600 bilingual abstracts (approximately 9,200 sentences). None of them overlaps with the remaining part of the corresponding data set. Relevant LMs are built on Japanese data drawn from the data sets. We took care not to train LMs on test sets. (See Section 6 for further details.)  split 10 blocks of a test set. The perplexity is that of Pl(j) averaged over blocks, with a particular LM plugged in for l (see Equation 1).</Paragraph>
    <Paragraph position="13"> We can see there an apparent tendency for an LM with lower perplexity to give rise to an FLMps with higher HFA, indicating that the choice of LM does indeed influence the performance of FLMps. Which is somewhat surprising given that the perplexity of a machine generated translation should be independent of how similar it is to a model translation, which dictates the HFA.7 Now let us turn to the question of whether there is any systematic way of choosing an LM so that it gives rise to a FLMps with high HFA. Since we are working with multiple OTS systems here, we get multiple outputs for a source text. Our idea is to let them vote for an LM to plug into FLMps or for that matter, any other forms of MEMT discussed earlier. Note that we could take an alternate approach of letting a model (or human) translation (associated with a source text) pick an LM by alone.</Paragraph>
    <Paragraph position="14"> An obvious problem with this approach, however, is that a mandatory reference to model translations would compromise the robustness of the approach.</Paragraph>
    <Paragraph position="15"> We would want the LM to work for MEMT regardless of whether model translations are available. So our concern here is more with choosing an LM in the absence of model translations, to which we will return below.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Voting Language Model
</SectionTitle>
    <Paragraph position="0"> We consider here a simple voting scheme `a la ROVER (Fiscus, 1997; Schwenk and Gauvain, 2000; Utsuro et al., 2003), which works by picking 7Recall that the HFA does not represent the confidence score such as one given by FLM (Equation 1), but the average ratio of the times that an MEMT based on FLM picks a translation with the best m-precision.</Paragraph>
    <Paragraph position="1">  M. S represents a set of OTS systems, L a set of language models. th is some confidence model such (r)FLM or (r)ALM. V-by-M chooses a most-votedfor LM among those in L, given the set J of translations for e.</Paragraph>
    <Paragraph position="3"> up an LM voted for by the majority. More specifically, for each output translation for a given input, we first pick up an LM which gives it the smallest perplexity, and out of those LMs, one picked by the majority of translations will be plugged into MEMT.</Paragraph>
    <Paragraph position="4"> We call the selection scheme voting-by-majority or simply V-by-M. The V-by-M scheme is motivated by the results in Fig.1, where perplexity is found to be a reasonably good predictor of HFA.</Paragraph>
    <Paragraph position="5"> Formally, we could put the V-by-M scheme as follows. For each of the translation outputs je1 ...jen associated with a given input sentence e, we want to</Paragraph>
    <Paragraph position="7"> Now assume M1 ...Mn are such LMs for je1 ...jen.</Paragraph>
    <Paragraph position="8"> Then we pick up an M with the largest frequency and plug it into th such as FLM.8 Suppose, for instance, that Ma, Mb, Ma and Mc are lowest perplexity LMs found for translations je1,je2,je3 and je4, respectively. Then we choose Ma as an LM most voted for, because it gets two votes from je1 and je3, meaning that Ma is nominated as an LM with lowest perplexity by je1 and je3, while Mb and Mc each collect only one vote. In case of ties, we randomly choose one of the LMs with the largest count of votes.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiment Setup and Procedure
</SectionTitle>
    <Paragraph position="0"> Let us describe the setup of experiments we have conducted. The goal here is to learn how the V-by-M affects the overall MEMT performance. For test sets, we carry over those from the perplexity experiments (see Footnote 6, Section 4), which are derived from CPC, EJP, and PAT. (Call them tCPC, tEJP, and tPAT hereafter.) In experiments, we begin by splitting a test set into equal-sized blocks, each containing 500 sentences for tEJP and tCPC, and 100 abstracts (approximately 200 sentences) for tPAT.9 We had the total of 15 blocks for tCPC and tEJP, and 46 blocks for tPAT. We leave one for evaluation and use the rest for training alignment models, i.e., Q(e  |j), SV regressors and some inside-data LMs. (Again we took care not to inadvertently train LMs on test sets.) We send a test block to OTSs Ai, Lo, At, and Ib, for translation and combine their outputs using the V-by-M scheme, which may or may not be coupled with regression SVMs. Recall that the MEMT operates on a sentence by sentence basis. So what happens here is that for each of the sentences in a block, the MEMT works the four MT systems to get translations and picks one that produces the best score under th.</Paragraph>
    <Paragraph position="1"> We evaluate the MEMT performance by running HFA and BLEU on MEMT selected translations block by block,10 and giving average performance over the blocks. Table 1 provides algorithmic details on how the MEMT actually operates.</Paragraph>
    <Paragraph position="2"> 8It is worth noting that the voted language model readily lends itself to a mixture model: P(j) =Pm[?]M lmP(j  |m) where lm = 1 if m is most voted for and 0 otherwise.</Paragraph>
    <Paragraph position="3"> 9tCPC had the average of 15,478 words per block, whereas tEJP had about 11,964 words on the average in each block. With tPAT, however, the average per block word length grew to 16,150.</Paragraph>
    <Paragraph position="4"> 10We evaluate performance by block, because of some reports in the MT literature that warn that BLEU behaves erratically on a small set of sentences (Reeder and White, 2003). See also Section 3 and Footnote 2 for the relevant discussion.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> Now let us see what we found from the experiments.</Paragraph>
    <Paragraph position="1"> We ran the MEMT on a test set with (r)FLM or (r)ALM embedded in it. Recall that our goal here is to find how the V-by-M affects performance of MEMT on tCPC, tEJP, and tPAT.</Paragraph>
    <Paragraph position="2"> First, we look at whether the V-by-M affects in any way, the HFA of the MEMT, and if it does, then how much. Table 2 and Table 3 give summaries of results on HFA versus V-by-M. Table 2 shows how things are with V-by-M on, and Table 3 shows what happens to HFA when we turn off V-by-M, that is, when we randomly choose an LM from the same set that the V-by-M chooses from. The results indicate a clear drop in performance of FLMps and ALMps when one chooses an LM randomly.11 Curiously, however, rFLMps and rALMps are affected less. They remain roughly at the same level of HFA over Table 2 and Table 3. What this means 11Another interesting question to ask at this point is, how does one huge LM trained across domains compare to the V-by-M here? By definition of perplexity, the increase in size of the training data leads to an increase in perplexity of the LM. So if general observations in Fig.1 hold, then we would expect the &amp;quot;one-huge-LM&amp;quot; approach to perform poorly compared to the V-by-M, which is indeed demonstrated by the following results. HFLMps below denotes a FLMps based on a composite LM trained over CPC, LIT, PAT, NIKMAI, and EJP. The testing procedure is same as that described in Sec.6  is that there is some discrepancy in the effectiveness of V-by-M between the fluency based and regression based models. We have no explanation for the cause of the discrepancy at this time, though we may suspect that in learning, as long as there is some pattern to exploit in m-precision and the probability estimates of test sentences, how accurate those estimates are may not matter much.</Paragraph>
    <Paragraph position="3"> Table 4 and Table 5 give results in BLEU.12 The results tend to replicate what we found with HFA.</Paragraph>
    <Paragraph position="4"> rFLMps and rALMps keep the edge over FLMps and ALMps whether or not V-by-M is brought into action. The differences in performance between rFLMps and rALMps with or without the V-by-M scheme are rather negligible. However, if we turn to FLMps and ALMps, the effects of the V-by-M are clearly visible. FLMps scores 0.2107 when coupled with the V-by-M. However, when disengaged, the score slips to 0.1946. The same holds for ALMps.</Paragraph>
    <Paragraph position="5">  Leaving the issue of MEMT models momentarily, let us see how the OTS systems Ai, Lo, At, and Ib are doing on tCPC, tEJP, and tPAT. Note that the whole business of MEMT would collapse if it slips behind any of the OTS systems that compose it.</Paragraph>
    <Paragraph position="6"> Table 6 and Table 7 show performance of the four OTS systems plus OPM, by HFA and by BLEU.</Paragraph>
    <Paragraph position="7"> OPM here denotes an oracle MEMT which operates by choosing in hindsight a translation that gives the best score in m-precision, among those produced by OTSs. It serves as a practical upper bound for MEMT while OTSs serve as baselines.</Paragraph>
    <Paragraph position="8"> First, let us look at Table 6 and compare it to Table 2. A good news is that most of the OTS systems do not even come close to the MEMT models. At, a best performing OTS system, gets 0.4643 on the average, which is about 20% less than that scored by rFLMps. Turning to BLEU, we find again in Table 7 that a best performing system among the OTSs, i.e., Ai, is outperformed by FLMps, ALMps and all their varieties (Table 4). Also something of note here is that on tPAT, (r)FLMps and (r)ALMps in Table 4, which operate by the V-by-M scheme, score somewhere from 0.1907 to 0.1954 in BLEU, coming close to OPM, which scores 0.1995 on tPAT (Table 7).</Paragraph>
    <Paragraph position="9"> It is interesting to note, incidentally, that there is some discrepancy between BLEU and HFA in performance of the OTSs: A top performing OTS in Table 6, namely At, achieves the average HFA of 0.4643, but scores only 0.1738 for BLEU (Table 7), which is worse than what Ai gets. Apparently, high HFA does not always mean a high BLEU score.</Paragraph>
    <Paragraph position="10"> Why? The reason is that a best MT output need not mark a high BLEU score. Notice that 'best' here means the best among translations by the OTSs. It could happen that a poor translation still gets chosen as best, because other translations are far worse.</Paragraph>
    <Paragraph position="11"> To return to the discussion of (r)FLMps and (r)ALMps, an obvious fact about their behavior is that regressor based systems rFLMps and rALMps, whether V-by-M enabled or not, surpass in performance their less sophisticated counterparts (see  the MEMT models to correct themselves for some domain-specific bias of the OTS systems. But the downside of using regression to capitalize on their bias is that you may need to be careful about data you train a regressor on.</Paragraph>
    <Paragraph position="12"> Here is what we mean. We ran experiments using SVM regressors trained on a set of data randomly sampled from tCPC, tEJP, and tPAT. (In contrast, rFLMps and rALMps in earlier experiments had a regressor trained separately on each data set.) They all operated in the V-by-M mode. The results are shown in Table 8 and Table 9. What we find there is that with regressors trained on perturbed data, both rFLMps and rALMps are not performing as well as before; in fact they even fall behind FLMps and ALMps in HFA and their performance in BLEU turns out to be just about as good as FLMps and ALMps.</Paragraph>
    <Paragraph position="13"> So regression may backfire when trained on wrong data.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Conclusion
</SectionTitle>
    <Paragraph position="0"> Let us summarize what we have done and learned from the work. We started with a finding that the choice of language model could affect performance of MEMT models of which it is part. The V-by-M was introduced as a way of responding to the problem of how to choose among LMs so that we get the best MEMT. We have shown that the V-by-M scheme is indeed up to the task, predicting a right LM most of the time. Also worth mentioning is that the MEMT models here, when coupled with V-by-M, are all found to surpass component OTS systems by a respectable margin (cf., Tables 4, 7 for BLEU, 2, 6 for HFA).</Paragraph>
    <Paragraph position="1"> Regressive MEMTs such as rFLMps and rALMps, are found to be not affected as much by the choice of LM as their non-regressive counterparts. We suspect this happens because they have access to extra information on the quality of translation derived from human judgments or translations, which may cloud effects of LMs on them. But we also pointed out that regressive models work well only when they are trained on right data; if you train them across different sources of varying genres, they could fail. An interesting question that remains to be addressed is how we might deal with translations from a novel domain. One possible approach would be to use a dynamic language model which adapts itself for a new domain by re-training itself on data sampled from the Web (Berger and Miller, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML