File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1006_metho.xml
Size: 14,052 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1006"> <Title>Automatic Acquisition of Hierarchical Transduction Models for Machine Translation</Title> <Section position="4" start_page="42" end_page="43" type="metho"> <SectionTitle> 3 Alignment </SectionTitle> <Paragraph position="0"> The first stage in the training process is obtaining, for each bitext, an alignment function f : W ~ V mapping word subsequences W in the source to word subsequences V in the target. In this process an alignment model is constructed which specifies a cost for each pairing (W, V) of source and target subsequences, and all alignment search is carried out to minimize the sum of the costs of a set of pairings which completely maps the bitext source to its target.</Paragraph> <Section position="1" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 3.1 Alignment model </SectionTitle> <Paragraph position="0"> The cost of a pairing is composed of a weighted combination of cost functions. We currently use two.</Paragraph> <Paragraph position="1"> The first cost function is the C/ correlation measure (cf the use of C/2 in Gale and Church (1991)) computed as follows:</Paragraph> <Paragraph position="3"> N is the total number of bitexts, nv the number of bitexts in which V appears in the target, nw the number of bitexts in which W appears in the source, and nw, y the number of bitexts in which W appears in the source and V appears in the target.</Paragraph> <Paragraph position="4"> We tried using the log probabilities of target subsequences given source subsequences (cf Brown et al. (1990)) as a cost function instead of C/ but C/ resulted in better performance of our translation models.</Paragraph> <Paragraph position="5"> The second cost function used is a distance measure which penalizes pairings in which the source subsequence and target subsequence are in very different positions in their respective sentences. Different weightings of distance to correlation costs can be used to bias the model towards more or less parallel alignments for different language pairs.</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.2 Alignment search </SectionTitle> <Paragraph position="0"> The agenda-based alignment search makes use of dynamic programming to record the best cost seen for all partial alignments covering the same source and target subsequence; partial alignments coming off the agenda that have a higher cost for the same coverage are discarded and take 11o further part in the search. An effort limit on the number of agenda items processed is used to ensure reasonable speed in the search regardless of sentence length. An iterative broadening strategy is used, so that at breadth i only the i lowest cost pairings for each source subsequence are allowed in the search, with the result that most optimal alignments are found well before the effort limit is reached.</Paragraph> <Paragraph position="1"> In the experiment reported in Section 7, source and target subsequences of lengths 0, 1 and 2 were allowed in pairings.</Paragraph> </Section> </Section> <Section position="5" start_page="43" end_page="44" type="metho"> <SectionTitle> 4 Transducer construction </SectionTitle> <Paragraph position="0"> Building a head transducer involves creating appropriate head transducer states and tracing hypothesized head-transducer transitions between them that are consistent with the occurrence of the pairings (W, f(W)) in each aligned bitext. When a source sequence W in an alignment pairing consists of more than one word, the least frequent of these words in the training corpus is taken to be the primary word of the subsequence. It is convenient to extend the domain of an alignment function f to include primary words w by setting f(w) = f(W).</Paragraph> <Paragraph position="1"> The main transitions that are traced in our construction are those that map heads, wl and wr, of the the right and left dependent phrases of w (see Figure 2) to their translations as indicated in the alignment. The positions of these dependents in the target string are computed by comparing the positions of f(wt) and f(wr) to the position of l: = f(w). The actual states and transitions in the construction are specified below.</Paragraph> <Paragraph position="2"> Additional transitions are included for cases of compounding, i.e. those for which the source subsequence in an alignment function pairing consists of more than one word. Specifically, the source subsequence W may be a compound consisting of a primary word w together with a secondary word w'. There are no additional transitions for cases in which the target subsequence V = f(w) of an alignment function pairing has more than one word. For the purposes of the head-transduction model constructed, such compound target subsequences are effectively treated as single words (containing space characters). That is, we are constructing a tranducer for (w : V).</Paragraph> <Paragraph position="3"> We use the notation Q(w : V) for states of the constructed head transducer. Here Q is an additional symbol e.g. &quot;initial&quot; for identifying a specific state of this transducer. A state such as initial(w : V) mentioned in the construction is first looked up in a table of states created so far in the training procedure; and created if necessary. A bar above a substring denotes the number of words preceding the substring in the source or target string.</Paragraph> <Paragraph position="4"> We give the construction for the case illustrated in Figure 2, i.e. one left dependent wt, one right dependent wr, and a single secondary word w' to the left of w. Figure 3 shows the result as part of a finite state transition diagram. The other transition arrows shown in the diagram will arise from other bitext alignments containing (w : V) pairings. Other cases covered by our algorithm (e.g. a single left dependent but no right dependent) are simple variants. null</Paragraph> <Paragraph position="6"> 1. Mark initial(w : V) as an initial state for the transducer.</Paragraph> <Paragraph position="7"> 2. Include a transition consuming the secondary word w t without any target output: (initial(w: V), leftw,(W : V), w', e, -1, 0), where e is the empty string.</Paragraph> <Paragraph position="8"> 3. Include a transition for mapping the source dependent wl to the target dependent f(wt): (le ftw,(w : V), midw~(w : V), wt, f(wl), -1,/31) where 13l = f(wt) - V.</Paragraph> <Paragraph position="9"> 4. Include a transition for mapping the source dependent wr to the target dependent f(w~):</Paragraph> <Paragraph position="11"> .5. Mark final(w : 1 I) as a final state for the transducer.</Paragraph> <Paragraph position="12"> The inclusion of transitions, and the marking of states as initial or final, are treated as event observation counts for a statistical head transduction model. More specifically, they are used as counts for maximum likelihood estimation of the transducer start, transition, and stop probabilities specified in Section 2.</Paragraph> </Section> <Section position="6" start_page="44" end_page="44" type="metho"> <SectionTitle> 5 Head selection </SectionTitle> <Paragraph position="0"> We have been using the following monolingual metrics which can be applied to either the source or target language to predict the likelihood of a word being the head word of a string. Distance: The distance between a dependent and its head. In general, the likelihood of a head-dependent relation decreases as distance increases (Collins, 1996).</Paragraph> <Paragraph position="1"> Word frequency: The frequency of occurrence of a word in the training corpus.</Paragraph> <Paragraph position="2"> IVord 'complezity': For languages with phonetic orthography such as English, 'complexity' of a word can be measured in terms of number of characters in that word.</Paragraph> <Paragraph position="3"> Optionality: This metric is intended to identify optional modifiers which are less likely to be heads. For each word we find trigrams with the word of interest as the middle word and compare the distribution of these trigrams with the distribution of the bigrams formed from the outer pairs of words. If these two distributions are strongly correlated then the word is highly optional.</Paragraph> <Paragraph position="4"> Each of the above metrics provides a score for the likelihood of a word being a head word. A weighted sum of these scores is used to produce a ranked list of head words given a string for use in step 2 of the training algorithm in Section 2. If the metrics are applied to the target language instead of the source, the ranking of a source word is taken from the ranking of the target word it is aligned with.</Paragraph> <Paragraph position="5"> In Section 7, we show the effectiveness of appropriate head selection in terms of the translation performance and size of the head transducer model in the context of an English-Spanish translation system.</Paragraph> </Section> <Section position="7" start_page="44" end_page="45" type="metho"> <SectionTitle> 6 Evaluation method </SectionTitle> <Paragraph position="0"> There is no agreed-upon measure of machine translation quality. For our current purposes we require a measure that is objective, reliable, and that can be calculated automatically.</Paragraph> <Paragraph position="1"> We use here the word accuracy measure of the string distance between a reference string and a result string, a measure standardly used in the automatic speech recognition (ASR) community. While for ASR the reference is a human transcription of the original speech and the result the output of the speech recognition process run on the original speech, we use the measure to compare two different translations of a given source, typically a human translation and a machine translation.</Paragraph> <Paragraph position="2"> The string distance metric is computed by first finding a transformation of one string into another that minimizes the total weight of substitutions, insertions and deletions. (We use the same weights for these operations as in the NIST ASR evaluation software (NIS, 1997).) If we write S for the resulting number of substitions, I for insertions, D for deletions, and R for number of words in the reference translation string, we can express the metric as follows: word accuracy = (1 D+S+I)_ R This measure has the merit of being completely automatic and non-subjective. However, taking any single translation as reference is unrealistically unfavourable, since there is a range of acceptable translations. To increase the reliability of the measure, therefore, we give each system translation the best score it receives against any of a number of independent human translations of the same source.</Paragraph> </Section> <Section position="8" start_page="45" end_page="45" type="metho"> <SectionTitle> 7 English-Spanish experiment </SectionTitle> <Paragraph position="0"> The training and test data for the experiments reported here were taken from a set of transcribed utterances from the air travel information system (ATIS) corpus together with a translation of each utterance to Spanish. An utterance is typically a single sentence but is sometimes more than one sentence spoken in sequence. There were 14418 training utterances, a total of 140788 source words, corresponding to 167865 target words. This training set was used as input to alignment model construction; alignment search was carried out only on sentences up to length 15, a total of 11542 bitexts. Transduction training (including head ranking) was carried out on the 11327 alignments obtained.</Paragraph> <Paragraph position="1"> Tlle test set used in the evaluations reported here consisted of 336 held-out English sentences.</Paragraph> <Paragraph position="2"> We obtained three separate human translations of this test set: trl was translated by the same translation bureau as the training data; tr2 was translated by a different translation bureau; crl was a correction of the output of the trained system by a professional translator.</Paragraph> <Paragraph position="3"> The models evaluated are sys: the automatically trained head transduction model; wfw: a baseline word-for-word model in which each English word is translated by the Spanish word most highly correlated with it in the corpus.</Paragraph> <Paragraph position="4"> Table 1 shows the word accuracy percentages (see Section 6) for the trained system sys and the word-for-word baseline wfw against trl (the original held-out translations) at various source sentence lengths. The trained system has word accuracy of 74.1% on sentences of all lengths; on sentences up to length 15 (the length on which the transduction model was trained) head selection methods Table 2 shows the word accuracy percentages for the trained system sys and the word-for-word baseline wfw against any of the three reference translations trl, crl, and tr2. That is, for each output string the human translation closest to it is taken as the reference translation. With this more accurate measure, the system's word accuracy is 78.5% on sentences of all lengths.</Paragraph> <Paragraph position="5"> Table 3 compares the performance of the translation system when head words are selected (a) at random (baseline), (b) with head selection in the source language, and (c) with head selection in the target language, i.e., selecting source heads that are aligned with the highest ranking target head words. The reference for word accuracy here is the single reference translation trl. Note that the 'In Target' head selection method is the one used in training translation model sys. The use of head selection metrics improves on random head selection in terms of translation accuracy and number of parameters. An interesting twist, however, is that applying the metrics to target strings performs better than applying the metrics to the source words directly.</Paragraph> </Section> class="xml-element"></Paper>