File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1004_intro.xml
Size: 6,443 bytes
Last Modified: 2025-10-06 14:03:23
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1004"> <Title>Segment Choice Models: Feature-Rich Models for Global Distortion in Statistical Machine Translation</Title> <Section position="4" start_page="25" end_page="26" type="intro"> <SectionTitle> 2 Disperp and Distortion Corpora </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.1 Defining Disperp </SectionTitle> <Paragraph position="0"> The ultimate reason for choosing one SCM over another will be the performance of an MT system containing it, as measured by a metric like BLEU (Papineni et al., 2002). However, training and testing a large-scale MT system for each new SCM would be costly. Also, the distortion component's effect on the total score is muffled by other components (e.g., the phrase translation and target language models). Can we devise a quick standalone metric for comparing SCMs? There is an offline metric for statistical language models: perplexity (Jelinek, 1990). By analogy, the higher the overall probability a given SCM assigns to a test corpus of representative distorted sentence hypotheses (DSHs), the better the quality of the SCM. To define distortion perplexity (&quot;disperp&quot;), let PrM(dk) = the probability an SCM M assigns to a DSH for sentence k, dk. If T is a test corpus comprising numerous DSHs, the probability of the corpus according to M is PrM(T) = a0 k PrM(dk).</Paragraph> <Paragraph position="1"> Let S(T) = total number of segments in T. Then disperp(M,T) = PrM(T)-1/S(T). This gives the mean number of choices model M allows; the lower the disperp for corpus T, the better M is as a model for T (a model X that predicts segment choice in T perfectly would have disperp(X,T) = 1.0).</Paragraph> </Section> <Section position="2" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 2.2 Some Simple A Priori SCMs </SectionTitle> <Paragraph position="0"> The uniform SCM assigns to the DSH dk that has S(dk) segments the probability 1/[S(dk)!] . We call this Model A. Let's define some other illustrative SCMs. Fig. 2 shows a sentence that has 7 segments with 10 words (numbered 0-9 by original order).</Paragraph> <Paragraph position="1"> Three segments in the source have been used; the decoder has a choice of four RS. Which of the RS has the highest probability of being chosen? Perhaps [2 3], because it is the leftmost RS: the &quot;leftmost&quot; predictor. Or, the last phrase in the DSH will be followed by the phrase that originally followed it, [8 9]: the &quot;following&quot; predictor. Or, perhaps positions in the source and target should be close, so since the next DSH position to be filled is 4, phrase [4] should be favoured: the &quot;parallel&quot; predictor. null Figure 2. Segment choice prediction example Model B will be based on the &quot;leftmost&quot; predictor, giving the leftmost segment in the RS twice the probability of the other segments, and giving the Original German: [ich] [habe] [das buch] [gelesen] [.] DSH for German: [ich] [habe] [gelesen] [das buch] [.] (English: [i] [have] [read] [the book] [.]) original: [0 1] [2 3] [4] [5] [6] [7] [8 9] DSH: [0 1] [5] [7], RS: [2 3], [4], [6], [8 9] others uniform probabilities. Model C will be based on the &quot;following&quot; predictor, doubling the probability for the segment in the RS whose first word was the closest to the last word in the DSH, and otherwise assigning uniform probabilities. Finally, Model D combines &quot;leftmost&quot; and &quot;following&quot;: where the leftmost and following segments are different, both are assigned double the uniform probability; if they are the same segment, that segment has four times the uniform probability. Of course, the factor of 2.0 in these models is arbitrary. For Figure 2, probabilities would be: Finally, let's define an SCM derived from the distortion penalty used by systems based on the &quot;following&quot; predictor, as in (Koehn, 2004). Let ai = start position of source phrase translated into ith target phrase, bi -1= end position of source phrase that's translated into (i-1)th target phrase. Then distortion penalty d(ai, bi-1) = a0 |ai- bi-1 -1|; the total distortion is the product of the phrase distortion penalties. This penalty is applied as a kind of non-normalized probability in the decoder. The value of a0 for given (source, target) languages is optimized on development data.</Paragraph> <Paragraph position="2"> To turn this penalty into an SCM, penalties are normalized into probabilities, at each decoding stage; we call the result Model P (for &quot;penalty&quot;). Model P with a0 = 1.0 is the same as uniform Model A. In disperp experiments, Model P with a0 optimized on held-out data performs better than Models A-D (see Figure 5), suggesting that disperp is a realistic measure.</Paragraph> <Paragraph position="3"> Models A-D are models whose parameters were all defined a priori; Model P has one trainable parameter, a1 . Next, let's explore distortion models with several trainable parameters.</Paragraph> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 2.3 Constructing a Distortion Corpus </SectionTitle> <Paragraph position="0"> To compare SCMs using disperp and to train complex SCMs, we need a corpus of representative examples of DSHs. There are several ways of obtaining such a corpus. For the experiments described here, the MT system was first trained on a bilingual sentence-aligned corpus. Then, the system was run in a second pass over its own training corpus, using its phrase table with the standard distortion penalty to obtain a best-fit phrase alignment between each (source, target) sentence pair. Each such alignment yields a DSH whose segments are aligned with their original positions in the source; we call such a source-DSH alignment a &quot;segment alignment&quot;. We now use a leave-one-out procedure to ensure that information derived from a given sentence pair is not used to segment-align that sentence pair. In our initial experiments we didn't do this, with the result that the segment-aligned corpus underrepresented the case where words or N-grams not in the phrase table are seen in the source sentence during decoding.</Paragraph> </Section> </Section> class="xml-element"></Paper>