File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1021_metho.xml
Size: 30,453 bytes
Last Modified: 2025-10-06 14:08:51
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1021"> <Title>Anoop Sarkar</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2.1 Baseline MT System: Alignment Templates </SectionTitle> <Paragraph position="0"> Our baseline MT system is the alignment template system described in detail by Och, Tillmann, and Ney (1999) and Och and Ney (2004). In the following, we give a short description of this baseline model.</Paragraph> <Paragraph position="1"> The probability model of the alignment template system for translating a sentence can be thought of in distinct stages. First, the source sentence words fJ1 are grouped to phrases ~fK1 . For each phrase ~f an alignment template z is chosen and the sequence of chosen alignment templates is reordered (according to piK1 ). Then, every phrase ~f produces its translation ~e (using the corresponding alignment template z). Finally, the sequence of phrases ~eK1 constitutes the sequence of words eI1.</Paragraph> <Paragraph position="2"> Our baseline system incorporated the following feature functions: Alignment Template Selection Each alignment template is chosen with probability p(z|~f), estimated by relative frequency. The corresponding feature function in our log-linear model is the log probability of the product of p(z|~f) for all used alignment templates used.</Paragraph> <Paragraph position="3"> Word Selection This feature is based on the lexical translation probabilities p(e|f), estimated using relative frequencies according to the highest-probability word-level alignment for each training sentence. A translation probability conditioned on the source and target position within the alignment template p(e|f,i,j) is interpolated with the position-independent probability p(e|f).</Paragraph> <Paragraph position="4"> Phrase Alignment This feature favors monotonic alignment at the phrase level. It measures the 'amount of non-monotonicity' by summing over the distance (in the source language) of alignment templates which are consecutive in the target language.</Paragraph> <Paragraph position="5"> Language Model Features As a language model feature, we use a standard backing off word-based tri-gram language model (Ney, Generet, and Wessel, 1995). The baseline system actually includes four different language model features trained on four different corpora: the news part of the bilingual training data, a large Xinhua news corpus, a large AFP news corpus, and a set of Chinese news texts downloaded from the web.</Paragraph> <Paragraph position="6"> Word/Phrase Penalty This word penalty feature counts the length in words of the target sentence. Without this feature, the sentences produced tend to be too short. The phrase penalty feature counts the number of phrases produced, and can allow the model to prefer either short or long phrases.</Paragraph> <Paragraph position="7"> Phrases from Conventional Lexicon The baseline alignment template system makes use of the Chinese-English lexicon provided by LDC. Each lexicon entry is a potential phrase translation pair in the alignment template system. To score the use of these lexicon entries (which have no normal translation probability), this feature function counts the number of times such a lexicon entry is used.</Paragraph> <Paragraph position="8"> Additional Features A major advantage of the log-linear modeling approach is that it is easy to add new features. In this paper, we explore a variety of features based on successively deeper syntactic representations of the source and target sentences, and their alignment. For each of the new features discussed below, we added the feature value to the set of baseline features, re-estimated feature weights on development data, and obtained results on test data.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Framework </SectionTitle> <Paragraph position="0"> We worked with the Chinese-English data from the recent evaluations, as both large amounts of sentence-aligned training corpora and multiple gold standard reference translations are available. This is a standard data set, making it possible to compare results with other systems.</Paragraph> <Paragraph position="1"> In addition, working on Chinese allows us to use the existing Chinese syntactic treebank and parsers based on it.</Paragraph> <Paragraph position="2"> For the baseline MT system, we distinguish the following three different sentence- or chunk-aligned parallel training corpora: * training corpus (train): This is the basic training corpus used to train the alignment template translation model (word lexicon and phrase lexicon). This corpus consists of about 170M English words. Large parts of this corpus are aligned on a sub-sentence level to avoid the existence of very long sentences which would be filtered out in the training process to allow a manageable word alignment training.</Paragraph> <Paragraph position="3"> * development corpus (dev): This is the training corpus used in discriminative training of the modelparameters of the log-linear translation model. In most experiments described in this report this corpus consists of 993 sentences (about 25K words) in both languages.</Paragraph> <Paragraph position="4"> * test corpus (test): This is the test corpus used to assess the quality of the newly developed feature functions. It consists of 878 sentences (about 25K words).</Paragraph> <Paragraph position="5"> For development and test data, we have four English (reference) translations for each Chinese sentence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Reranking, n-best lists, and oracles </SectionTitle> <Paragraph position="0"> For each sentence in the development, test, and the blind test corpus a set of 16,384 different alternative translations has been produced using the baseline system. For extracting the n-best candidate translations, an A* search is used. These n-best candidate translations are the basis for discriminative training of the model parameters and for re-ranking.</Paragraph> <Paragraph position="1"> We used n-best reranking rather than implementing new search algorithms. The development of efficient search algorithms for long-range dependencies is very complicated and a research topic in itself. The reranking strategy enabled us to quickly try out a lot of new dependencies, which would not have been be possible if the search algorithm had to be changed for each new dependency. null On the other hand, the use of n-best list rescoring limits the possibility of improvements to what is available in the n-best list. Hence, it is important to analyze the quality of the n-best lists by determining how much of an improvement would be possible given a perfect reranking algorithm. We computed the oracle translations, that is, the set of translations from our n-best list that yields the best BLEU score.1 We use the following two methods to compute the BLEU score of an oracle translation: 1. optimal oracle (opt): We select the oracle sentences which give the highest BLEU score compared to the set of 4 reference translations. Then, we compute BLEU score of oracle sentences using the same set of reference translations.</Paragraph> <Paragraph position="2"> 2. round-robin oracle (rr): We select four different sets of oracle sentences which give the highest BLEU score compared to each of the 4 references translations. Then, we compute for each set of oracle sentences a BLEU score using always those three references to score that have not been chosen to select the oracle. Then, these 4 3-reference BLEU scores are averaged.</Paragraph> <Paragraph position="3"> 1Note that due to the corpus-level holistic nature of the BLEU score it is not trivial to compute the optimal set of oracle translations. We use a greedy search algorithm for the oracle translations that might find only a local optimum. Empirically, we do not observe a dependence on the starting point, hence we believe that this does not pose a significant problem. n-best list. The avBLEUr3 scores are computed with respect to three reference translations averaged over the four different choices of holding out one reference.</Paragraph> <Paragraph position="5"> what BLEU score can be obtained by rescoring a given n-best list. Using this method with a 1000-best list, we obtain oracle translations that outperform the BLEU score of the human translations. The oracle translations achieve 113% against the human BLEU score on the test data (Table 1), while the first best translations obtain 79.2% against the human BLEU score. The second method uses a different references for selection and scoring. Here, using an 1000-best list, we obtain oracle translations with a relative human BLEU score of 88.5%.</Paragraph> <Paragraph position="6"> Based on the results of the oracle experiment, and in order to make rescoring computationally feasible for features requiring significant computation for each hypothesis, we used the top 1000 translation candidates for our experiments. The baseline system's BLEU score is 31.6% on the test set (equivalent to the 1-best oracle in Table 1). This is the benchmark against which the contributions of the additional features described in the remainder of this paper are to be judged.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Preprocessing </SectionTitle> <Paragraph position="0"> As a precursor to developing the various syntactic features described in this report, the syntactic representations on which they are based needed to be computed. This involved part-of-speech tagging, chunking, and parsing both the Chinese and English side of our training, development, and test sets.</Paragraph> <Paragraph position="1"> Applying the part-of-speech tagger to the often ungrammatical MT output from our n-best lists sometimes led to unexpected results. Often the tagger tries to &quot;fix up&quot; ungrammatical sentences, for example by looking for a verb when none is present: China NNP 14 CD open JJ border NN cities NNS achievements VBZ remarkable JJ Here, although achievements has never been seen as a verb in the tagger's training data, the prior for a verb in this position is high enough to cause a present tense verb tag to be produced. In addition to the inaccuracies of the MT system, the difference in genre from the tagger's training text can cause problems. For example, while our MT data include news article headlines with no verb, headlines are not included in the Wall Street Journal text on which the tagger is trained. Similarly, the tagger is trained on full sentences with normalized punctuation, leading it to expect punctuation at the end of every sentence, and produce a punctuation tag even when the evidence does not support it: China NNP 's POS economic JJ development NN and CC opening VBG up RP 14 CD border NN cities NNS remarkable JJ achievements .</Paragraph> <Paragraph position="2"> The same issues affect the parser. For example the parser can create verb phrases where none exist, as in the following example in which the tagger correctly did not identify a verb in the sentence: These effects have serious implications for designing syntactic feature functions. Features such &quot;is there a verb phrase&quot; may not do what you expect. One solution would be features that involve the probability of a parse subtree or tag sequence, allowing us to ask &quot;how good a verb phrase is it?&quot; Another solution is more detailed features examining more of the structure, such as &quot;is there a verb phrase with a verb?&quot;</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Word-Level Feature Functions </SectionTitle> <Paragraph position="0"> These features, directly based on the source and target strings of words, are intended to address such problems as translation choice, missing content words, and incorrect punctuation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Model 1 Score </SectionTitle> <Paragraph position="0"> We used IBM Model 1 (Brown et al., 1993) as one of the feature functions. Since Model 1 is a bag-of-word translation model and it gives the sum of all possible alignment probabilities, a lexical co-occurrence effect, or triggering effect, is expected. This captures a sort of topic or semantic coherence in translations.</Paragraph> <Paragraph position="1"> As defined by Brown et al. (1993), Model 1 gives a probability of any given translation pair, which is</Paragraph> <Paragraph position="3"> t(fj|ei).</Paragraph> <Paragraph position="4"> We used GIZA++ to train the model. The training data is a subset (30 million words on the English side) of the entire corpus that was used to train the baseline MT system. For a missing translation word pair or unknown words, where t(fj|ei) = 0 according to the model, a constant t(fj|ei) = 10[?]40 was used as a smoothing value.</Paragraph> <Paragraph position="5"> The average %BLEU score (average of the best four among different 20 search initial points) is 32.5. We also tried p(e|f; M1) as feature function, but did not obtain improvements which might be due to an overlap with the word selection feature in the baseline system.</Paragraph> <Paragraph position="6"> The Model 1 score is one of the best performing features. It seems to 'fix' the tendency of our baseline system to delete content words and it improves word selection coherence by the triggering effect. It is also possible that the triggering effect might work on selecting a proper verb-noun combination, or a verb-preposition combination. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Lexical Re-ordering of Alignment Templates </SectionTitle> <Paragraph position="0"> As shown in Figure 1 the alignment templates (ATs) used in the baseline system can appear in various configurations which we will call left/right-monotone and left/right-continuous. We built 2 out of these 4 models to distinguish two types of lexicalized re-ordering of these ATs: The left-monotone model computes the total probability of all ATs being left monotone: where the lower left corner of the AT touches the upper right corner of the previous AT. Note that the first word in the current AT may or may not immediately follow the last word in the previous AT. The total probability is the product over all alignment templates i, either P(ATi is left-monotone) or 1 [?]P(ATi is left-monotone).</Paragraph> <Paragraph position="1"> The right-continuous model computes the total probability of all ATs being right continuous: where the lower left corner of the AT touches the upper right corner of the previous AT and the first word in the current AT immediately follows the last word in the previous AT. The total probability is the product over all alignment templates i, either P(ATi is right-continuous) or 1 [?]P(ATi is right-continuous).</Paragraph> <Paragraph position="2"> In both models, the probabilities P have been estimated from the full training data (train).</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Shallow Syntactic Feature Functions </SectionTitle> <Paragraph position="0"> By shallow syntax, we mean the output of the part-of-speech tagger and chunkers. We hope that such features can combine the strengths of tag- and chunk-based translation systems (Schafer and Yarowsky, 2003) with our baseline system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Projected POS Language Model </SectionTitle> <Paragraph position="0"> This feature uses Chinese POS tag sequences as surrogates for Chinese words to model movement. Chinese words are too sparse to model movement, but an attempt to model movement using Chinese POS may be more successful. We hope that this feature will compensate for a weak model of word movement in the baseline system.</Paragraph> <Paragraph position="1"> Chinese POS sequences are projected to English using the word alignment. Relative positions are indicated for each Chinese tag. The feature function was also tried without the relative positions: CD +0 M +1 NN +3 NN -1 NN +2 NN +3 14 (measure) open border cities The table shows an example tagging of an English hypothesis showing how it was generated from the Chinese sentence. The feature function is the log probability output by a trigram language model over this sequence. This is similar to the HMM Alignment model (Vogel, Ney, and Tillmann, 1996) but in this case movement is calculated on the basis of parts of speech.</Paragraph> <Paragraph position="2"> The Projected POS feature function was one of the strongest performing shallow syntactic feature functions, with a %BLEU score of 31.8. This feature function can be thought of as a trade-off between purely word-based models, and full generative models based upon shallow syntax.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Tree-Based Feature Functions </SectionTitle> <Paragraph position="0"> Syntax-based MT has shown promise in the work of, among others, Wu and Wong (1998) and Alshawi, Bangalore, and Douglas (2000). We hope that adding features based on Treebank-based syntactic analyses of the source and target sentences will address grammatical errors in the output of the baseline system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Parse Tree Probability </SectionTitle> <Paragraph position="0"> The most straightforward way to integrate a statistical parser in the system would be the use of the (log of the) parser probability as a feature function. Unfortunately, this feature function did not help to obtain better results (it actually seems to significantly hurt performance).</Paragraph> <Paragraph position="1"> To analyze the reason for this, we performed an experiment to test if the used statistical parser assigns a higher probability to presumably grammatical sentences.</Paragraph> <Paragraph position="2"> The following table shows the average log probability assigned by the Collins parser to the 1-best (produced), oracle and the reference translations:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Hypothesis 1-best Oracle Reference </SectionTitle> <Paragraph position="0"> log(parseProb) -147.2 -148.5 -154.9 We observe that the average parser log-probability of the 1-best translation is higher than the average parse log probability of the oracle or the reference translations. Hence, it turns out that the parser is actually assigning higher probabilities to the ungrammatical MT output than to the presumably grammatical human translations. One reason for that is that the MT output uses fewer unseen words and typically more frequent words which lead to a higher language model probability. We also performed experiments to balance this effect by dividing the parser probability by the word unigram probability and using this 'normalized parser probability' as a feature function, but also this did not yield improvements.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Tree-to-String Alignment </SectionTitle> <Paragraph position="0"> A tree-to-string model is one of several syntax-based translation models used. The model is a conditional probability p(f|T(e)). Here, we used a model defined by Yamada and Knight (2001) and Yamada and Knight (2002).</Paragraph> <Paragraph position="1"> Internally, the model performs three types of operations on each node of a parse tree. First, it reorders the child nodes, such as changing VP - VB NP PP into</Paragraph> <Paragraph position="3"> each node. Third, it translates the leaf English words into Chinese words. These operations are stochastic and their probabilities are assumed to depend only on the node, and are independent of other operations on the node, or other nodes. The probability of each operation is automatically obtained by a training algorithm, using about 780,000 English parse tree-Chinese sentence pairs. The probability of these operations th(eki,j) is assumed to depend on the edge of the tree being modified, eki,j, but independent of everything else, giving the following equation,</Paragraph> <Paragraph position="5"> where Th varies over the possible alignments between the f and e and th(eki,j) is the particular operations (in Th) for the edge eki,j.</Paragraph> <Paragraph position="6"> The model is further extended to incorporate phrasal translations performed at each node of the input parse tree (Yamada and Knight, 2002). An English phrase covered by a node can be directly translated into a Chinese phrase without regular reorderings, insertions, and leafword translations.</Paragraph> <Paragraph position="7"> The model was trained using about 780,000 English parse tree-Chinese sentence pairs. There are about 3 million words on the English side, and they were parsed by Collins' parser.</Paragraph> <Paragraph position="8"> Since the model is computationally expensive, we added some limitations on the model operations. As the base MT system does not produce a translation with a big word jump, we restrict the model not to reorder child nodes when the node covers more than seven words. For a node that has more than four children, the reordering probability is set to be uniform. We also introduced pruning, which discards partial (subtree-substring) alignments if the probability is lower than a threshold.</Paragraph> <Paragraph position="9"> The model gives a sum of all possible alignment probabilities for a pair of a Chinese sentence and an English parse tree. We also calculate the probability of the best alignment according to the model. Thus, we have the following two feature functions:</Paragraph> <Paragraph position="11"> As the model is computationally expensive, we sorted the n-best list by the sentence length, and processed them from the shorter ones to the longer ones. We used 10 CPUs for about five days, and 273/997 development sentences and 237/878 test sentences were processed.</Paragraph> <Paragraph position="12"> The average %BLEU score (average of the best four among different 20 search initial points) was 31.7 for both hTreeToStringSum and hTreeToStringViterbi. Among the processed development sentences, the model preferred the oracle sentences over the produced sentence in 61% of the cases.</Paragraph> <Paragraph position="13"> The biggest problem of this model is that it is computationally very expensive. It processed less than 30% of the n-best lists in long CPU hours. In addition, we processed short sentences only. For long sentences, it is not practical to use this model as it is.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Tree-to-Tree Alignment </SectionTitle> <Paragraph position="0"> A tree-to-tree translation model makes use of syntactic tree for both the source and target language. As in the tree-to-string model, a set of operations apply, each with some probability, to transform one tree into another.</Paragraph> <Paragraph position="1"> However, when training the model, trees for both the source and target languages are provided, in our case from the Chinese and English parsers.</Paragraph> <Paragraph position="2"> We began with the tree-to-tree alignment model presented by Gildea (2003). The model was extended to handle dependency trees, and to make use of the word-level alignments produced by the baseline MT system. The probability assigned by the tree-to-tree alignment model, given the word-level alignment with which the candidate translation was generated, was used as a feature in our rescoring system.</Paragraph> <Paragraph position="3"> We trained the parameters of the tree transformation operations on 42,000 sentence pairs of parallel Chinese-English data from the Foreign Broadcast Information Service (FBIS) corpus. The lexical translation probabilities Pt were trained using IBM Model 1 on the 30 million word training corpus. This was done to overcome the sparseness of the lexical translation probabilities estimated while training the tree-to-tree model, which was not able to make use of as much training data.</Paragraph> <Paragraph position="4"> As a test of the tree-to-tree model's discrimination, we performed an oracle experiment, comparing the model scores on the first sentence in the n-best list with candidate giving highest BLEU score. On the 1000-best list for the 993-sentence development set, restricting ourselves to sentences with no more than 60 words and a branching factor of no more than five in either the Chinese or English tree, we achieved results for 480, or 48% of the 993 sentences. Of these 480, the model preferred the produced over the oracle 52% of the time, indicating that it does not in fact seem likely to significantly improve BLEU scores when used for reranking. Using the probability of the source Chinese dependency parse aligning with the n-best hypothesis dependency parse as a feature function, making use of the word-level alignments, yields a 31.6 %BLEU score -- identical to our baseline.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.4 Markov Assumption for Tree Alignments </SectionTitle> <Paragraph position="0"> The tree-based feature functions described so far have the following limitations: full parse tree models are expensive to compute for long sentences and for trees with flat constituents and there is limited reordering observed in the n-best lists that form the basis of our experiments. In addition to this, higher levels of parse tree are rarely observed to be reordered between source and target parse trees.</Paragraph> <Paragraph position="1"> In this section we attack these problems using a simple Markov model for tree-based alignments. It guarantees tractability: compared to a coverage of approximately 30% of the n-best list by the unconstrained tree-based models, using the Markov model approach provides 98% coverage of the n-best list. In addition, this approach is robust to inaccurate parse trees.</Paragraph> <Paragraph position="2"> The algorithm works as follows: we start with word alignments and two parameters: n for maximum number of words in tree fragment and k for maximum height of tree fragment. We proceed from left to right in the Chinese sentence and incrementally grow a pair of subtrees, one subtree in Chinese and the other in English, such that each word in the Chinese subtree is aligned to a word in the English subtree. We grow this pair of subtrees until we can no longer grow either subtree without violating the two parameter values n and k. Note that these aligned subtree pairs have properties similar to alignment templates. They can rearrange in complex ways between source and target. Figure 2 shows how subtree-pairs for parameters n = 3 and k = 3 can be drawn for this sentence pair. In our experiments, we use substantially bigger tree fragments with parameters set to n = 8 and k = 9.</Paragraph> <Paragraph position="3"> Once these subtree-pairs have been obtained, we can easily assert a Markov assumption for the tree-to-tree and tree-to-string translation models that exploits these pairings. Let consider a sentence pair in which we have discovered n subtree-pairs which we can call Frag0, ..., Fragn. We can then compute a feature function for the sentence pair using the tree-to-string translation model as follows: the Tree to String model described in Section 6.2 we obtain a coverage improvement to 98% coverage from the original 30%. The accuracy of the tree to string model also improved with a %BLEU score of 32.0 which is the best performing single syntactic feature.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.5 Using TAG elementary trees for scoring word </SectionTitle> <Paragraph position="0"> alignments In this section, we consider another method for carving up the full parse tree. However, in this method, instead of subtree-pairs we consider a decomposition of parse trees that provides each word with a fragment of the original parse tree as shown in Figure 3. The formalism of Tree-Adjoining Grammar (TAG) provides the definition what each tree fragment should be and in addition how to decompose the original parse trees to provide the fragments. Each fragment is a TAG elementary tree and the composition of these TAG elementary trees in a TAG derivation tree provides the decomposition of the parse trees. The decomposition into TAG elementary trees is done by augmenting the parse tree for source and target sentence with head-word and argument (or complement) information using heuristics that are common to most contemporary statistical parsers and easily available for both English and Chinese. Note that we do not use the word alignment information for the decomposition into TAG elementary trees.</Paragraph> <Paragraph position="1"> Once we have a TAG elementary tree per word, we can create several models that score word alignments by exploiting the alignments between TAG elementary trees between source and target. Let tfi and tei be the TAG elementary trees associated with the aligned words fi and ei respectively. We experimented with two models over alignments: unigram model over alignments: producttexti P(fi,tfi,ei,tei) and conditional model:producttext i P(ei,tei |fi,tfi) xP(fi+1,tfi+1 |fi,tfi)We trained both of these models using the SRI Language Modeling Toolkit using 60K aligned parse trees.</Paragraph> <Paragraph position="2"> We extracted 1300 TAG elementary trees each for Chi- null nese and for English. The unigram model gets a %BLEU score of 31.7 and the conditional model gets a %BLEU score of 31.9.</Paragraph> <Paragraph position="3"> ture added to the baseline features on its own, and a combination of new features.</Paragraph> </Section> </Section> class="xml-element"></Paper>