File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1032_metho.xml

Size: 20,410 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1032">
  <Title>Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases</Title>
  <Section position="4" start_page="255" end_page="257" type="metho">
    <SectionTitle>
3 Scaling to Long Phrases
</SectionTitle>
    <Paragraph position="0"> Table 1 gives statistics about the Arabic-English parallel corpus used in the NIST large data track. The corpus contains 3.75 million sentence pairs, and has 127 million words in English, and 106 million words in Arabic. The table shows the number of unique Arabic phrases, and gives the average number of translations into English and their average length.</Paragraph>
    <Paragraph position="1"> Table 2 gives estimates of the size of the lookup tables needed to store phrases of various lengths, based on the statistics in Table 1. The number of unique entries is calculated as the number unique length entries  that occur in the NIST-2004 test set phrases times the average number of translations. The number of words in the table is calculated as the number of unique phrases times the phrase length plus the number of entries times the average translation length. The memory is calculated assuming that each word is represented with a 4 byte integer, that each entry stores its probability as an 8 byte double and that each word alignment is stored as a 2 byte short. Note that the size of the table will vary depending on the phrase extraction technique.</Paragraph>
    <Paragraph position="2"> Table 3 gives the percent of the 35,313 word long test set which can be covered using only phrases of the specified length or greater. The table shows the efficacy of using phrases of different lengths. The table shows that while the rate of falloff is rapid, there are still multiple matches of phrases of length 10.</Paragraph>
    <Paragraph position="3"> The longest matching phrase was one of length 18.</Paragraph>
    <Paragraph position="4"> There is little generalization in current SMT implementations, and consequently longer phrases generally lead to better translation quality.</Paragraph>
    <Section position="1" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
3.1 Why use phrases?
</SectionTitle>
      <Paragraph position="0"> Statistical machine translation made considerable advances in translation quality with the introduction of phrase-based translation. By increasing the size of the basic unit of translation, phrase-based machine translation does away with many of the problems associated with the original word-based formulation of statistical machine translation (Brown et al., 1993), in particular: * The Brown et al. (1993) formulation doesn't have a direct way of translating phrases; instead they specify a fertility parameter which is used to replicate words and translate them individually. null * With units as small as words, a lot of reordering has to happen between languages with different word orders. But the distortion parameter is a poor explanation of word order.</Paragraph>
      <Paragraph position="1"> Phrase-based SMT overcomes the first of these problems by eliminating the fertility parameter and directly handling word-to-phrase and phrase-to-phrase mappings. The second problem is alleviated through the use of multi-word units which reduce the dependency on the distortion parameter. Less word re-ordering need occur since local dependencies are frequently captured. For example, common adjective-noun alternations are memorized. However, since this linguistic information is not encoded in the model, unseen adjective noun pairs may still be handled incorrectly.</Paragraph>
      <Paragraph position="2"> By increasing the length of phrases beyond a few words, we might hope to capture additional non-local linguistic phenomena. For example, by memorizing longer phrases we may correctly learn case information for nouns commonly selected by frequently occurring verbs; we may properly handle discontinuous phrases (such as French negation, some German verb forms, and English verb particle constructions) that are neglected by current phrase-based models; and we may by chance capture some agreement information in coordinated structures.</Paragraph>
    </Section>
    <Section position="2" start_page="256" end_page="257" type="sub_section">
      <SectionTitle>
3.2 Deciding what length of phrase to store
</SectionTitle>
      <Paragraph position="0"> Despite the potential gains from memorizing longer phrases, the fact remains that as phrases get longer length coverage length coverage  the specified length there is a decreasing likelihood that they will be repeated. Because of the amount of memory required to store a phrase table, in current implementations a choice is made as to the maximum length of phrase to store.</Paragraph>
      <Paragraph position="1"> Based on their analysis of the relationship between translation quality and phrase length, Koehn et al. (2003) suggest limiting phrase length to three words or less. This is entirely a practical suggestion for keeping the phrase table to a reasonable size, since they measure minor but incremental improvement in translation quality up to their maximum tested phrase length of seven words.1 Table 4 gives statistics about phrases which occur more than once in the English section of the Europarl corpus (Koehn, 2002) which was used in the Koehn et al. (2003) experiments. It shows that the percentage of words in the corpus that can be covered by repeated phrases falls off rapidly at length 6, but that even phrases up to length 10 are able to cover a non-trivial portion of the corpus. This draws into question the desirability of limiting phrase retrieval to length three.</Paragraph>
      <Paragraph position="2"> The decision concerning what length of phrases to store in the phrase table seems to boil down to a practical consideration: one must weigh the likelihood of retrieval against the memory needed to store longer phrases. We present a data structure where this is not a consideration. Our suffix array-based data structure allows the retrieval of arbitrarily long phrases, while simultaneously requiring far less memory than the standard table-based representation. null 1While the improvements to translation quality reported in Koehn et al. (2003) are minor, their evaluation metric may not have been especially sensitive to adding longer phrases. They used the Bleu evaluation metric (Papineni et al., 2002), but capped the n-gram precision at 4-grams.</Paragraph>
      <Paragraph position="3">  spain declined to confirm that spain declined to aid morocco declined to confirm that spain declined to aid morocco to confirm that spain declined to aid morocco confirm that spain declined to aid morocco that spain declined to aid morocco spain declined to aid morocco declined to aid morocco to aid morocco aid morocco morocco spain declined to confirm that spain declined aidto morocco</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="257" end_page="259" type="metho">
    <SectionTitle>
4 Suffix Arrays
</SectionTitle>
    <Paragraph position="0"> The suffix array data structure (Manber and Myers, 1990) was introduced as a space-economical way of creating an index for string searches. The suffix array data structure makes it convenient to compute the frequency and location of any substring or n-gram in a large corpus. Abstractly, a suffix array is an alphabetically-sorted list of all suffixes in a corpus, where a suffix is a substring running from each position in the text to the end. However, rather than actually storing all suffixes, a suffix array can be constructed by creating a list of references to each of the suffixes in a corpus. Figure 1 shows how a suffix array is initialized for a corpus with one sentence. Each index of a word in the corpus has a corresponding place in the suffix array, which is identical in length to the corpus. Figure 2 shows the final state of the suffix array, which is as a list of the indices of words in the corpus that corresponds to an alphabetically sorted list of the suffixes.</Paragraph>
    <Paragraph position="1"> The advantages of this representation are that it is compact and easily searchable. The total size of the suffix array is a constant amount of memory. Typically it is stored as an array of integers where the array is the same length as the corpus. Because it is organized alphabetically, any phrase can be quickly located within it using a binary search algorithm.</Paragraph>
    <Paragraph position="2"> Yamamoto and Church (2001) show how to use suffix arrays to calculate a number of statistics that are interesting in natural language processing applications. They demonstrate how to calculate term fre- null to aid morocco to confirm that spain declined to aid morocco morocco spain declined to aid morocco declined to confirm that spain declined to aid morocco declined to aid morocco confirm that spain declined to aid morocco aid morocco that spain declined to aid morocco spain declined to confirm that spain declined to aid morocco  quency / inverse document frequency (tf / idf) for all n-grams in very large corpora, as well as how to use these frequencies to calculate n-grams with high mutual information and residual inverse document frequency. Here we show how to apply suffix arrays to parallel corpora to calculate phrase translation probabilities. null</Paragraph>
    <Section position="1" start_page="257" end_page="258" type="sub_section">
      <SectionTitle>
4.1 Applied to parallel corpora
</SectionTitle>
      <Paragraph position="0"> In order to adapt suffix arrays to be useful for statistical machine translation we need a data structure with the following elements: * A suffix array created from the source language portion of the corpus, and another created from the target language portion of the corpus, * An index that tells us the correspondence between sentence numbers and positions in the source and target language corpora, * An alignment a for each sentence pair in the parallel corpus, where a is defined as a subset of the Cartesian product of the word positions in a sentence e of length I and a sentence f of length J: a [?] {(i,j) : i = 1...I;j = 1...J} * A method for extracting the translationally equivalent phrase for a subphrase given an aligned sentence pair containing that subphrase. null The total memory usage of the data structure is thus the size of the source and target corpora, plus the size of the suffix arrays (identical in length to the  corpora), plus the size of the two indexes that correlate sentence positions with word positions, plus the size of the alignments. Assuming we use ints to represent words and indices, and shorts to represent word alignments, we get the following memory</Paragraph>
    </Section>
    <Section position="2" start_page="258" end_page="259" type="sub_section">
      <SectionTitle>
4.2 Calculating phrase translation
</SectionTitle>
      <Paragraph position="0"> probabilities In order to produce a set of phrase translation probabilities, we need to examine the ways in which they are calculated. We consider two common ways of calculating the translation probability: using the maximum likelihood estimator (MLE) and smoothing the MLE using lexical weighting.</Paragraph>
      <Paragraph position="1"> The maximum likelihood estimator for the probability of a phrase is defined as</Paragraph>
      <Paragraph position="3"> Where count( -f, -e) gives the total number of times the phrase -f was aligned with the phrase -e in the parallel corpus. We define phrase alignments as follows. A substring -e consisting of the words at positions l...m is aligned with the phrase -f by way of the subalignment s = a [?]{(i,j) : i = l...m,j = 1...J} The aligned phrase -f is the subphrase in f which spans from min(j) to max(j) for j|(i,j) [?] s.</Paragraph>
      <Paragraph position="4"> The procedure for generating the counts that are used to calculate the MLE probability using our suffix array-based data structures is:  1. Locate all the suffixes in the English suffix array which begin with the phrase -e. Since the suffix array is sorted alphabetically we can easily find the first occurrence s[k] and the last occurrence s[l]. The length of the span in the suffix array l[?]k+1 indicates the number of occurrences of -e in the corpus. Thus the denominatorsummationtext -f count( -f, -e) can be calculated as l [?]k + 1.</Paragraph>
      <Paragraph position="5"> 2. For each of the matching phrases s[i] in the span s[k]...s[l], look up the value of s[i] which is the word index w of the suffix in the English corpus. Look up the sentence number that includes w, and retrieve the corresponding sentences e and f, and their alignment a.</Paragraph>
      <Paragraph position="6"> 3. Use a to extract the target phrase -f that aligns with the phrase -e that we are searching for. Increment the count for &lt; -f, -e &gt;.</Paragraph>
      <Paragraph position="7"> 4. Calculate the probability for each unique matching phrase -f using the formula in Equation 1.</Paragraph>
      <Paragraph position="8">  A common alternative formulation of the phrase translation probability is to lexically weight it as follows: null  Where n is the length of -e.</Paragraph>
      <Paragraph position="9"> In order to use lexical weighting we would need to repeat steps 1-4 above for each word ei in -e. This would give us the values for p(fj|ei). We would further need to retain the subphrase alignment s in order to know the correspondence between the words (i,j) [?] s in the aligned phrases, and the total number of foreign words that each ei is aligned with (|{i|(i,j) [?] s}|). Since a phrase alignment &lt; -f, -e &gt; may have multiple possible word-level alignments, we retain a set of alignments S and take the maximum: null</Paragraph>
      <Paragraph position="11"> Thus our suffix array-based data structure can be used straightforwardly to look up all aligned translations for a given phrase and calculate the probabilities on-the-fly. In the next section we turn to the computational complexity of constructing phrase translation probabilities in this way.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="259" end_page="260" type="metho">
    <SectionTitle>
5 Computational Complexity
</SectionTitle>
    <Paragraph position="0"> Computational complexity is relevant because there is a speed-memory tradeoff when adopting our data structure. What we gained in memory efficiency may be rendered useless if the time it takes to calculate phrase translation probabilities is unreasonably long. The computational complexity of looking up items in a hash table, as is done in current table-based data structures, is extremely fast. Looking up a single phrase can be done in unit time, O(1).</Paragraph>
    <Paragraph position="1"> The computational complexity of our method has the following components:  phrases using our phrase extraction algorithm * The complexity of calculating the probabilities given the aligned phrases The methods we use to execute each of these, and their complexities are as follow: * Since the array is sorted, finding all occurrences of the English phrase is extremely fast. We can do two binary searches: one to find the first occurrence of the phrase and a second to find the last. The computational complexity is therefore bounded by O(2log(n)) where n is the length of the corpus.</Paragraph>
    <Paragraph position="2"> * We use a similar method to look up the sentences ei and fi and word-level alignment ai phrase freq O time (ms) respect for the  phrases of different frequencies that are associated with the position wi in the corpus of each phrase occurrence -ei. The complexity is O(k [?]2log(m)) where k is the number of occurrences of -e and m is the number of sentence pairs in the parallel corpus.</Paragraph>
    <Paragraph position="3"> * The complexity of extracting the aligned phrase for a single occurrence of -ei is O(2log(|ai|) to get the subphrase alignment si, since we store the alignments in a sorted array. The complexity of then getting -fi from si is O(length(-fi)). * The complexity of summing over all aligned phrases and simultaneously calculating their probabilities is O(k).</Paragraph>
    <Paragraph position="4"> Thus we have a total complexity of:</Paragraph>
    <Paragraph position="6"> for the MLE estimation of the translation probabilities for a single phrase. The complexity is dominated by the k terms in the equation, when the number of occurrences of the phrase in the corpus is high. Phrases with high frequency may cause excessively long retrieval time. This problem is exacerbated when we shift to a lexically weighted calculation of the phrase translation probability. The complexity will be multiplied across each of the component words in the phrase, and the component words themselves will be more frequent than the phrase.</Paragraph>
    <Paragraph position="7"> Table 5 shows example times for calculating the translation probabilities for a number of phrases. For frequent phrases like of the these times get unacceptably long. While our data structure is perfect for  overcoming the problems associated with storing the translations of long, infrequently occurring phrases, it in a way introduces the converse problem. It has a clear disadvantage in the amount of time it takes to retrieve commonly occurring phrases. In the next section we examine the use of sampling to speed up the calculation of translation probabilities for very frequent phrases.</Paragraph>
  </Section>
  <Section position="7" start_page="260" end_page="260" type="metho">
    <SectionTitle>
6 Sampling
</SectionTitle>
    <Paragraph position="0"> Rather than compute the phrase translation probabilities by examining the hundreds of thousands of occurrences of common phrases, we instead sample from a small subset of the occurrences. It is unlikely that we need to extract the translations of all occurrences of a high frequency phrase in order to get a good approximation of their probabilities.</Paragraph>
    <Paragraph position="1"> We instead cap the number of occurrences that we consider, and thus give a maximum bound on k in Equation 5.</Paragraph>
    <Paragraph position="2"> In order to determine the effect of different levels of sampling, we compare the translation quality against cumulative retrieval time for calculating the phrase translation probabilities for all subphrases in an evaluation set. We translated a held out set of 430 German sentences with 50 words or less into English. The test sentences were drawn from the 01/17/00 proceedings of the Europarl corpus. The remainder of the corpus (1 million sentences) was used as training data to calculate the phrase translation probabilities. We calculated the translation quality using Bleu's modified n-gram precision metric (Papineni et al., 2002) for n-grams of up to length four. The framework that we used to calculate the translation probabilities was similar to that detailed in Koehn et al. (2003). That is:</Paragraph>
    <Paragraph position="4"> Where pLM is a language model probability and d is a distortion probability which penalizes movement.</Paragraph>
    <Paragraph position="5"> Table 6 gives a comparison of the translation quality under different levels of sampling. While the ac- null lation quality when the number of translations is capped at various sample sizes curacy fluctuates very slightly it essentially remains uniformly high for all levels of sampling. There are a number of possible reasons for the fact that the quality does not decrease: * The probability estimates under sampling are sufficiently good that the most probable translations remain unchanged, * The interaction with the language model probability rules out the few misestimated probabilities, or * The decoder tends to select longer or less frequent phrases which are not affected by the sampling.</Paragraph>
    <Paragraph position="6"> While the translation quality remains essentially unchanged, the cumulative time that it takes to calculate the translation probabilities for all subphrases in the 430 sentence test set decreases radically. The total time drops by orders of magnitude from an hour and a half without sampling down to a mere 10 seconds with a cavalier amount of sampling. This suggests that the data structure is suitable for deployed SMT systems and that no additional caching need be done to compensate for the structure's computational complexity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML