XML Viewer - j90-2002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/j90-2002_metho.xml
Size: 21,381 bytes
Last Modified: 2025-10-06 14:12:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="J90-2002">
  <Title>A STATISTICAL APPROACH TO MACHINE TRANSLATION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 THE LANGUAGE MODEL
</SectionTitle>
    <Paragraph position="0"> Given a word string, sis 2 ... s n, we can, without loss of generality, write</Paragraph>
    <Paragraph position="2"> Thus, we can recast the language modeling problem as one of computing the probability of a single word given all of the words that precede it in a sentence. At any point in the sentence, we must know the probability of an object word, s i, given a history, s~s2. * * Si_l. Because there are so many histories, we cannot simply treat each of these probabilities as a separate parameter. One way to reduce the number of parameters is to place each of the histories into an equivalence class in some way and then to allow the probability of an object word to depend on the history only through the equivalence class into which that history falls. In an n-gram model, two histories are equivalent if they agree in their final n- 1 words. Thus, in a bigram model, two histories are equivalent if they end in the same word and in a trigram model, two histories are equivalent if they end in the same two words.</Paragraph>
    <Paragraph position="3"> While n-gram models are linguistically simpleminded, they have proven quite valuable in speech recognition and have the redeeming feature that they are easy to make and to use. We can see the power of a trigram model by applying it to something that we call bag translation from English into English. In bag translation we take a sentence, cut it up into words, place the words in a bag, and then try to recover the sentence given the bag. We use the n-gram model to rank different arrangements of the words in the bag. Thus, we treat an arrangement S as better than another arrangement S' if Pr(S) is greater than Pr(S').</Paragraph>
    <Paragraph position="4"> We tried this scheme on a random sample of sentences.</Paragraph>
    <Paragraph position="5"> From a collection of 100 sentences, we considered the 38 sentences with fewer than 11 words each. We had to restrict the length of the sentences because the number of possible rearrangements grows exponentially with sentence length. We used a trigram language model that had been constructed for a speech recognition system. We were able to recover 24 (63%) of the sentences exactly. Sometimes, the sentence that we found to be most probable was not an exact reproduction of the original, but conveyed the same meaning. In other cases, of course, the most probable sentence according to our model was just garbage. If we count as correct all of the sentences that retained the meaning of the original, then 32 (84%) of the 38 were correct. Some examples of the original sentences and the sentertces recovered from the bags are shown in Figure 2.</Paragraph>
    <Paragraph position="6"> We :have no doubt that if we had been able to handle longer sentertces, the results would have been worse and that the probability of error grows rapidly with sentence length.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 THE TRANSLATION MODEL
</SectionTitle>
    <Paragraph position="0"> For simple sentences, it is reasonable to think of the French translation of an English sentence as being generated from the English sentence word by word. Thus, in the sentence pair (,lean aime Marie I John loves Mary) we feel that John  Marie. We say that a word is aligned with the word that it produces. Thus John is aligned with Jean in the pair that we just discussed. Of course, not all pairs of sentences are as simple as this example. In the pair (Jean n'aime personne\[John loves nobody), we can again align John with Jean and loves with aime, but now, nobody aligns with both n' and personne. Sometimes, words in the English sentence of the pair align with nothing in the French sentence, and similarly, occasionally words in the French member of the pair do not appear to go with any of the words in the English sentence. We refer to a picture such as that shown in Figure 3 as an alignment. An alignment indicates the origin in the English sentence of each of the words in the French sentence. We call the number of French words that an English word produces in a given alignment its fertility in that alignment.</Paragraph>
    <Paragraph position="1"> If we look at a number of pairs, we find that words near the beginning of the English sentence tend to align with words near the beginning of the French sentence and that words near the end of the English sentence tend to align with words near the end of the French sentence. But this is not always the case. Sometimes, a French word will appear quite far from the English word that produced it. We call this effect distortion. Distortions will, for example, allow adjectives to precede the nouns that they modify in English but to follow them in French.</Paragraph>
    <Paragraph position="2"> It is convenient to introduce the following notation for alignments. We write the French sentence followed by the English sentence and enclose the pair in parentheses. We separate the two by a vertical bar. Following each of the English words, we give a parenthesized list of the positions of the words in the French sentence with which it is aligned. If an English word is aligned with no French words, then we omit the list. Thus (Jean aime MarielJohn(1) loves(2) Mary(3) ) is the simple alignment with which we began this discussion. In the alignment (Le chien est battu par Jean\[John(6) does beat(3,4) the(l) dog(2) ), John produces Jean, does produces nothing, beat produces est battu, the produces Le, dog produces chien, and par is not produced by any of the English words.</Paragraph>
    <Paragraph position="3"> Rather than describe our translation model formally, we present it by working an example. To compute the probability of the alignment (Le chien est battu par Jean\[John(6) does beat(3,4) the(l) dog(2)), begin by multiplying the probability that John has fertility 1 by Pr(Jean\[John).</Paragraph>
    <Paragraph position="4"> The proposal will  not now be implemented malntenant Then multiply by the probability that does has fertility 0. Next, multiply by the probability that beat has fertility 2 times Pr(estlbeat)Pr(battulbeat), and so on. The word par is produced from a special English word which is denoted by (null). The result is</Paragraph>
    <Paragraph position="6"> Finally, factor in the distortion probabilities. Our model for distortions is, at present, very simple. We assume that the position of the target word depends only on the length of the target sentence and the position of the source word. Therefore, a distortion probability has the form Pr(i\[j, 1) where i is a target position, j a source position, and 1 the target length.</Paragraph>
    <Paragraph position="7"> In summary, the parameters of our translation model are a set of fertility probabilities Pr(n\[e) for each English word e and for each fertility n from 0 to some moderate limit, in our case 25; a set of translation probabilities Pr (fie), one for each element f of the French vocabulary and each member e of the English vocabulary; and a set of distortion probabilities Pr(i\[j, l) for each target position i, source position j, and target length l. We limit i, j, and l to the range 1 to 25.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 SEARCHING
</SectionTitle>
    <Paragraph position="0"> In searching for the sentence S that maximizes Pr(S) Pr(T\[S), we face the difficulty that there are simply too many sentences to try. Instead, we must carry out a suboptimal search. We do so using a variant of the stack search that has worked so well in speech recognition (Bahl et al. 1983). In a stack search, we maintain a list of partial alignment hypotheses. Initially, this list contains only one entry corresponding to the hypothesis that the target sentence arose in some way from a sequence of source words that we do not know. In the alignment notation introduced earlier, this entry might be (Jean aime Marie I *) where the asterisk is a place holder for an unknown sequence of source words. The search proceeds by iterations, each of which extends some of the most promising entries on the list. An entry is extended by adding one or more additional words to its hypothesis. For example, we might extend the initial entry above to one or more of the following entries:  The search ends when there is a complete alignment on the list that is significantly more promising than any of the incomplete alignments.</Paragraph>
    <Paragraph position="1"> Sometimes, the sentence S' that is found in this way is not the same as the sentence S that a translator might Computational Linguistics Volume 16, Number 2, June 1990 81 Peter F. Brown et al. A Statistical Approach to Machine Translation have been working on. When S' itself is not an acceptable translation, then there is clearly a problem. If Pr(S')Pr(T\[S') is greater than Pr(S)Pr(TIS), then the problem lies in our modeling of the language or of the translation process. If, however, Pr(S')Pr(T\[ S') is less than Pr(S)Pr(TIS), then our search has failed to find the most likely sentence. We call this latter type of failure a search error. In the case of a search error, we can be sure that our search procedure has failed to find the most probable source sentence, but we cannot be sure that were we to correct the search we would also correct the error. We might simply find an even more probable sentence that nonetheless is incorrect. Thus, while a search error is a clear indictment of the search procedure, it is not an acquittal of either the language model or the translation model.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 PARAMETER ESTIMATION
</SectionTitle>
    <Paragraph position="0"> Both the language model and the translation model have many parameters that must be specified. To estimate these parameters accurately, we need a large quantity of data.</Paragraph>
    <Paragraph position="1"> For the parameters of the language model, we need only English text, which is available in computer-readable form from many sources; but for the parameters of the translation model, we need pairs of sentences that are translations of one another.</Paragraph>
    <Paragraph position="2"> By law, the proceedings of the Canadian parliament are kept in both French and English. As members rise to address a question before the house or otherwise express themselves, their remarks are jotted clown in whichever of the two languages is used. After the meeting adjourns, a collection of translators begins working to produce a complete set of the proceedings in both French and English.</Paragraph>
    <Paragraph position="3"> * These proceedings are called Hansards, in remembrance of the publisher of the proceedings of the British parliament in the early 1800s. All of these proceedings are available in computer-readable form, and we have been able to obtain about 100 million words of English text and the corresponding French text from the Canadian government. Although the translations are not made sentence by sentence, we have been able to extract about three million pairs of sentences by using a statistical algorithm based on sentence length.</Paragraph>
    <Paragraph position="4"> Approximately 99% of these pairs are made up of sentences that are actually translations of one another. It is this collection of sentence pairs, or more properly various sub-sets of this collection, from which we have estimated the parameters of the language and translation models.</Paragraph>
    <Paragraph position="5"> In the experiments we describe later, we use a bigram language model. Thus, we have one parameter for every pair of words in the source language. We estimate these parameters from the counts of word pairs in a large sample of text from the English part of our Hansard data using a method described by Jelinek and Mercer (1980).</Paragraph>
    <Paragraph position="6"> In Section 3 we discussed alignments of sentence pairs. If we had a collection of aligned pairs of sentences, then we could estimate the parameters of the translation model by counting, just as we do for the language model. However, we do not have alignments but only the unaligned pairs of sentences. This is exactly analogous to the situation in speech recognition where one has the script of a sentence and the time waveform corresponding to an utterance of it, but no indication of just what in the time waveform corresponds to what in the script. In speech recognition, this problem is attacked with the EM algorithm (Baum 1972; Dempster et al. 1977). We have adapted this algorithm to our problem in translation. In brief, it works like this: given some :initial estimate of the parameters, we can compute the probability of any particular alignment. We can then re-estimate the parameters by weighing each possible alignment according to its probability as determined by the initial guess of the parameters. Repeated iterations of this process lead to parameters that assign ever greater probability to the set of sentence pairs that we actually observe. This algorithm leads to a local maximum of the probability of the observed pairs as a function of the parameters of the model. There may be many such local maxima. The particular one at which we arrive will, in general, depend on the initial choice of parameters.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Two PILOT EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> In our first experiment, we test our ability to estimate parameters for the translation model. We chose as our English vocabulary the 9,000 most common words in the English part of the Hansard data, and as our French vocabulary the 9,000 most common French words. For the purposes of this experiment, we replaced all other words with either the unknown English word or the unknown Frenc.h word, as appropriate. We applied the iterative algorithm discussed above in order to estimate some 81 millJion parameters from 40,000 pairs of sentences comprising a total of about 800,000 words in each language. The algorithm requires an initial guess of the parameters. We assumted that each of the 9,000 French words was equally probable as a translation of any of the 9,000 English words; we assumed that each of the fertilities from 0 to 25 was equally probable for each of the 9,000 English words; and finally, we assumed that each target position was equally probable given each source position and target length.</Paragraph>
    <Paragraph position="1"> Thus, our initial choices contained very little information about either French or English.</Paragraph>
    <Paragraph position="2"> Fi\[gure 4 shows the translation and fertility probabilities we estimated for the English word the. We see that, according to the model, the translates most frequently into the French articles le and la. This is not surprising, of course, but we emphasize that it is determined completely automatically by the estimation process. In some sense, this correspondence is inherent in the sentence pairs themselves.</Paragraph>
    <Paragraph position="3"> Figure 5 shows these probabilities for the English word not.</Paragraph>
    <Paragraph position="4"> As expected, the French word pas appears as a highly probable translation. Also, the fertility probabilities indicate that not translates most often into two French words, a situation consistent with the fact that negative French sentences contain the auxiliary word ne in addition to a primary negative word such as pas or rien.</Paragraph>
    <Paragraph position="5">  For both of these words, we could easily have discovered the same information from a dictionary. In Figure 6, we see the trained parameters for the English word hear. As we would expect, various forms of the French word entendre appear as possible translations, but the most probable translation is the French word bravo. When we look at the fertilities here, we see that the probability is about equally divided between fertility 0 and fertility 1. The reason for this is that the English speaking members of parliament express their approval by shouting Hear, hear/, while the French speaking ones say Bravo/The translation model has learned that usually two hears produce one bravo by having one of them produce the bravo and the other produce nothing.</Paragraph>
    <Paragraph position="6"> A given pair of sentences has many possible alignments, since each target word can be aligned with any source word. A translation model will assign significant probability only to some of the possible alignments, and we can gain further insight about the model by examining the alignments that it considers most probable. We show one such alignment in Figure 3. Observe that, quite reasonably, not is aligned with ne and pas, while implemented is aligned with the phrase mises en application. We can also see here English: not  a deficiency of the model since intuitively we feel that will and be act in concert to produce seront while the model aligns will with seront but aligns be with nothing.</Paragraph>
    <Paragraph position="7"> In our second experiment, we used the statistical approach to translate from French to English. To have a manageable task, we limited the English vocabulary to the 1,000 most frequently used words in the English part of the Hansard corpus. We chose the French vocabulary to be the 1,700 most frequently used French words in translations of sentences that were completely covered by the 1,000-word English vocabulary. We estimated the 17 million parameters of the translation model from 117,000 pairs of sentences that were completely covered by both our French and English vocabularies. We estimated the parameters of the bigram language model from 570,000 sentences from the English part of the Hansard data. These sentences contain about 12 million words altogether and are not restricted to sentences completely covered by our vocabulary. null We used our search procedure to decode 73 new French sentences from elsewhere in the Hansard data. We assigned each of the resulting sentences a category according to the following criteria. If the decoded sentence was exactly the same as the actual Hansard translation, we assigned the sentence to the exact category. If it conveyed the same meaning as the Hansard translation but in slightly different words, we assigned it to the alternate category. If the decoded sentence was a legitimate translation of the French sentence but did not convey the same meaning as the Hansard translation, we assigned it to the different category. If it made sense as an English sentence but could not be interpreted as a translation of the French sentence, we assigned it to the wrong category. Finally, if the decoded sentence was grammatically deficient, we assigned it to the ungrammatical category. An example from each category is shown in Figure 7, and our decoding results are summarized in Figure 8.</Paragraph>
    <Paragraph position="8"> Only 5% of the sentences fell into the exact category.</Paragraph>
    <Paragraph position="9"> However, we feel that a decoded sentence that is in any of the first three categories (exact, alternate, or different) represents a reasonable translation. By this criterion, the system performed successfully 48% of the time.</Paragraph>
    <Paragraph position="10"> As an alternate measure of the system's performance, one of us corrected each of the sentences in the last three categories (different, wrong, and ungrammatical) to either the exact or the alternate category. Counting one stroke for Computational Linguistics Volume 16, Number 2, June 1990 83  These amendments are certainly necessary.</Paragraph>
    <Paragraph position="11"> These amendments are certainly necessary.</Paragraph>
    <Paragraph position="12"> C'est pourtant tr~s simple.</Paragraph>
    <Paragraph position="13"> Yet it is very simple.</Paragraph>
    <Paragraph position="14"> It is still very simple.</Paragraph>
    <Paragraph position="15"> J'al re~u cette demande en effet.</Paragraph>
    <Paragraph position="16"> Such a request was made.</Paragraph>
    <Paragraph position="17"> I have received this request in effect.</Paragraph>
    <Paragraph position="18"> Permettez que je donne un example ~, la Chambre.</Paragraph>
    <Paragraph position="19"> Let me give the House one example.</Paragraph>
    <Paragraph position="20"> Let me give an example in the House.</Paragraph>
    <Paragraph position="21"> Vous avez besoin de toute l'~de disponible.</Paragraph>
    <Paragraph position="22"> You need all the help you can get.</Paragraph>
    <Paragraph position="23"> You need of the whole benefits available.</Paragraph>
    <Paragraph position="24">  each letter that must be deleted and one stroke for each letter that must be inserted, 776 strokes were needed to repair all of the decoded sentences. This compares with the 1,916 strokes required to generate all of the Hansard translations from scratch. Thus, to the extent that translation time can be equated with key strokes, the system reduces the work by about 60%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML