File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0115_metho.xml
Size: 26,397 bytes
Last Modified: 2025-10-06 14:14:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0115"> <Title>Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons</Title> <Section position="5" start_page="185" end_page="185" type="metho"> <SectionTitle> 2 EXPERIMENTAL FRAMEWORK </SectionTitle> <Paragraph position="0"> All translation lexicons discussed in this paper were created and evaluated using the procedure in Figure 1. First, candidate translations were generated for each pair of aligned training sentences, by taking a simple cross-product of the words. Next, the candidate translations from each pair of training sentences were passed through a cascade of filters. The remaining candidate translations from all training sentence pairs were pooled together and fed into a fixed decision procedure. The output of the decision procedure was a model of word correspondences between the two halves of the training corpus -- a translation lexicon. Each filter combination resulted in a different model.</Paragraph> <Paragraph position="1"> All the models were compared in terms of how well they represented a held-out test set. The evaluation was performed objectively and automatically using Bitext-Based Lexicon Evaluation (BIBLE, described below). BIBLE assigned a score for each model, and these scores were used to compare the effectiveness of various filter cascades.</Paragraph> <Paragraph position="2"> As shown in Figure 1, the only independent variable in the framework is the cascade of filters used on the translation candidates generated by each sentence pair, while the only dependent variable is a numerical score. Since the filters only serve to remove certain translation candidates, any number of filters can be used in sequence. This arrangement allows for fair comparison of different filter combinations.</Paragraph> </Section> <Section position="6" start_page="185" end_page="188" type="metho"> <SectionTitle> 3 BITEXT-BASED LEXICON EVALUATION (BIBLE) </SectionTitle> <Paragraph position="0"> Translation lexicon quality has traditionally been measured on two axes: precision and recall. Recall is the fraction of the source language's vocabulary that appears in the lexicon. Precision is the fraction of lexicon entries that are correct. While the true size of the source vocabulary is usually unknown, recall can be estimated using a representative text sample by computing the fraction of words in the text that also appear in the lexicon. Measuring precision is much more difficult, because it is unclear what a &quot;correct&quot; lexicon entry is -- different translations are appropriate for different contexts, and, in most cases, more than one translation is correct. This is why evaluation of translation has eluded automation efforts until now.</Paragraph> <Paragraph position="1"> The large number of quantitative lexicon evaluations required for the present study made it infeasible to rely on evaluation by human judges. The only existing automatic lexicon evaluation method that I am aware of is the perplexity comparisons used by Brown et al. in the fl'amework of their Model 1 \[Bro93\]. Lexicon perplexity indicates how &quot;sure&quot; a translation lexicon is about its contents. It does not, however, directly measure the quality of those contents.</Paragraph> <Paragraph position="2"> BIBLE is a family of algorithms, based on the observation that translation pairs 2 tend to appear in corresponding sentences in an aligned bilingual text corpus (a bitext). Given a test set of aligned sentences, a better translation lexicon will contain a higher fraction of the (source word, target word) pairs in those sentences. This fraction can be computed either by token or by type, depending on ghe application. If only the words in the lexicon are considered, BIBLE gives an estimate of precision. If all the words in the text are considered, then BIBLE measures percent correct. The greater the overlap between the vocabulary of the test bitext and the vocabulary of the lexicon being evaluated, the more confidence can be placed in the BIBLE score.</Paragraph> <Paragraph position="3"> The BIBLE approach is suitable for many different evaluation tasks. Besides comparing different lexicons on different scales, BIBLE can be used to compare different parts of one lexicon that has been partitione d using some characteristic of its entries. For example, the quality of a lexicon's noun entries can be compared to the quality of its adjective entries; the quality of its entries for frequent words can be compared to the quality of its entries for rare words. Likewise, separate evaluations canl be performed for each k, 1 < k < N, in N-best lexicons.</Paragraph> <Paragraph position="4"> Figure 2 shows the outline of a BIBLE algorithm for evaluating precision of N-best translation lexicons. The kth cumulative hit rate for a source word S is the fraction of test sentences containing S whose translations contain one of the k best translations of S in the lexicon. For each k, the kth -- Percent correct can be evaluated instead of precision by switching lines 3 and 4.</Paragraph> <Paragraph position="5"> Input: 1. translation lexicon with up to N translations for each word 2. aligned test bitext cumulative hit rates are averaged over all the source words in the lexicon, counting words by type. This yields N average cumulative hit rates for the lexicon as a whole.</Paragraph> <Paragraph position="6"> In this study, the average is computed by type and not by token, because translations for the most frequent words are easy to estimate using any reasonable statistical decision procedure, even without any extra information. Token-based evaluation scores would be misleadingly inflated with very little variation. Computing hit rates for each word separately and then taking an unweighted average ensures that a correct translation of a common source word does not contribute more to the score than :correct translations of rare words. The evaluation is uniform over the whole lexicon. BIBLE evaluation is quite harsh, because many translations are not word for word in real bitexts. To put BIBLE scores reported here into proper perspective, human performance was evaluated on a similar task. The 1994 ARPA-sponsored machine translation evaluation effort generated two independent English translations of one hundred French newspaper texts \[Whi93\]. I hand-aligned each pair of translations by paragraph; most paragraphs contained between one and four sentences. For each pair of translations, the fl'action of times (by type) that identical words were used in corresponding :paragraphs was computed. The average of these 100 fl'actions was 0.6182 with a standard deViation of 0.0647. This is a liberal estimate of the upper bound on the internal consistency of :BIBLE test sets. Scores for sentence-based comparisons will always be lower than scores for paragraph-based comparisons, because there will be fewer spurious &quot;hits.&quot; To confirm this, an independent second translation of 50 French Hansard sentences was commissioned. The translation scored 0.57 on this test.</Paragraph> </Section> <Section position="7" start_page="188" end_page="192" type="metho"> <SectionTitle> 4 EXPERIMENTS </SectionTitle> <Paragraph position="0"> A bilingual teXt corpus of Canadian parliamentary proceedings (&quot;Hansards&quot;) was aligned by sentence using the method presented in \[Gal91b\]. From the resulting aligned corpus, this study used only sentence pairs that were aligned one to one, and then only when they were less than 16 words long and aligned with high confidence. Morphological variants in these sentences were stemmed to a canonical form. Fifteen thousand sentence pairs were randomly selected and reserved for testing; one hundred thousand were used for training.</Paragraph> <Paragraph position="1"> The independent variable in the experiments was a varying combination of four different filters, used with six different sizes of training corpora. These four filters fall into three categories: predicate filters, oracle filters and alignment filters. A predicate .filter is one where the candidate translation pair (S, T) must satisfy some predicate in order to pass the filter. Various predicate filters are discussed in \[Wu94\]. An oracle filter is useful when a list of likely translation pairs is available a priori. Then i if the translation pair (S, T) occurs in this oracle list, it is reasonable to filter out all other translation pairs involving S or T in the same sentence pair. An alignment filter is based on the relative positions of S and T in their respective texts\[Dag93\].</Paragraph> <Paragraph position="2"> The decision procedure used to select lexicon entries from the multiset of candidate translation pairs is a variation of the method presented in \[Gal91a\]. \[Dun93\] found binomial log-likelihood ratios to be relatively accurate when dealing with rare tokens. This statistic was used to estimate dependencies between all co-occuring (source word, target word) pairs. For each source word S, target words were ranked by their dependence with S. The top N target words in the rank-ordering for S formed the entry for S in the N-best lexicon. In other words, the relative magnitude of dependence between S and its candidate translations was used as a maximum likelihood estimator of the translations of S.</Paragraph> <Section position="1" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 4.1 Part of Speech Filter </SectionTitle> <Paragraph position="0"> The POS Filter is a predicate filter. It is based on the idea that word pairs that are good translations of each other are likely to be the same parts of speech in their respective languages. For example, a noun in one language is very unlikely to be translated as a verb in another language. Therefore, candidate translation pairs involving different parts of speech should be filtered out.</Paragraph> <Paragraph position="1"> This heuristic should not be taken too far, however, in light of the imperfection of today's tagging technology. For instance, particles are often confused with prepositions and adjectives with past participles. These considerations are further complicated by the differences in the tag sets used by taggers for different languages. To maximize the filter's effectiveness, tag sets must be remapped to a more general common tag set, which ignores many of the language-specific details.</Paragraph> <Paragraph position="2"> Otherwise, correct translation pairs would be filtered out because of superficial differences like tense and capitalization.</Paragraph> <Paragraph position="3"> The different ways to remap different tag sets into a more general common tag set represent a number of design decisions. Fortunately, BIBLE provided an objective criterion for tag set design, and a fast evaluation method. The English half of the corpus was tagged using Brill's transformation-based tagger \[Bri92\]. The French half was kindly tagged by George Foster of CITI.</Paragraph> <Paragraph position="4"> Then, BIBLE was used to select among several possible generalizations of the two tag sets. The resulting optimal tag set is shown in Table 2.</Paragraph> </Section> <Section position="2" start_page="189" end_page="190" type="sub_section"> <SectionTitle> 4.2 Machine-Readable Bilingual Dictionary (MRBD) </SectionTitle> <Paragraph position="0"> An oracle list of 53363 one-to-one translation pairs was extracted from the Collins French-English MRBD \[Cou91\]. whenever a candidate translation pair (S,T) appeared in the list of translations extracted from the MRBD, the filter removed all word pairs (S, not T) and (not S, T) that occurred in the same sentence pair.</Paragraph> <Paragraph position="1"> The MRBD Filter is an oracle filter. It is based on the assumption that if a candidate translation pair (S,T) appears in an oracle list of likely translations, then T is the correct translation of S in their sentence i pair, and there are no other translations of S or T in that sentence pair. This assumption is stronger than the one made by Brown et al. \[Bro94\], where the MRBD was treated as data and not as an oracle. Brown et al. allowed the training data to override inibrmation gleaned from the MRBD. The attitude of the present study is &quot;Don't guess when you know.&quot; This attitude may be less appropriate when there is less of an overlap between the vocabulary of the MRBD and the vocabulary of the training bitext, as when dealing with technical text or with a very small MRBD.</Paragraph> <Paragraph position="2"> The presented framework can be used as a method of enhancing an MRBD. Merging an MRBD with an N-best translation lexicon induced using the MRBD Filter will result in an MRBD with more entries that are relevant to the sublanguage of the training bitext. All the relevant entries will be rank orldered for appropriateness.</Paragraph> </Section> <Section position="3" start_page="190" end_page="190" type="sub_section"> <SectionTitle> 4.3 Cognate Filter </SectionTitle> <Paragraph position="0"> A Cognate Filt~er is another kind of oracle filter. It is based on the simple heuristic that if a source word S is a cognate of some target word T, then T is the correct translation of SS in their sentence pair, and there are no other translations of S or T in that sentence pair. Of course, identical words can mean different things in different languages. The cognate heuristic fails when dealing with such faux amis \[Mac94\]. Fortunately, between French and English, true cognates occur far more frequently than faux amis.</Paragraph> <Paragraph position="1"> There are many possible notions of what a cognate is. Simard et al. used the criterion that the first four characters must be identical for alphabetic tokens to be considered cognates \[Sim92\]. Unfortunately, this criterion produces false negatives for pairs like &quot;government&quot; and &quot;gouvernement&quot;, and false positives for words with a great difference in length, like &quot;conseil&quot; and &quot;conservative.&quot; I used an appro~mate string matching algorithm to capture a more general notion of cognateness.</Paragraph> <Paragraph position="2"> Whether a pair of words is considered a cognate pair depends on the ratio of the length of their longest (not necessarily contiguous) common subsequence to the length of the longer word. This is called the Longest Common Subsequence Ratio (LCSR). For example, &quot;gouvernement,&quot; which is 12 letters long, has 10 letters that appear in the same order in &quot;government.&quot; So, the LCSR for these two words is 10/12. On the other hand, the LCSR for &quot;conseil&quot; and &quot;conservative&quot; is only 6/12. The only: remaining question was what minimum LCSR value should indicate that two words are cognates. This question was easy to answer using BIBLE. BIBLE scores were maximized for lexicons using t!he Cognate Filter when a LCSR cut-off of 0.58 was used. The Wilcoxon signed ranks test found the difference between BIBLE scores for lexicons produced with this LCSR cut-off and for lexicons produced with the criterion used in \[Sim92\] to be statistically significant at oC/ = 0.01.</Paragraph> <Paragraph position="3"> The longest common subsequence between two words can be computed as a special case of their edit distance, in time proportional to the product of their lengths\[Wag74\]. 3</Paragraph> </Section> <Section position="4" start_page="190" end_page="192" type="sub_section"> <SectionTitle> 4.4 Word Alignment Filter </SectionTitle> <Paragraph position="0"> Languages with a similar syntax tend to express ideas in similar order. The translation of a word occurring: at the end of a French sentence is likely to occur towards the end of the English translation. In general, lines drawn between corresponding lexemes in a French sentence and its Les neo-democrates ont aussi paris de General Motors dans ce contexte.</Paragraph> <Paragraph position="1"> The NDP Members also mentioned General Motors in this context.</Paragraph> <Paragraph position="2"> minimized by aligning D with g rather than with c.</Paragraph> <Paragraph position="3"> English translation will be mostly parallel. This idea of translation alignment was central to the machine translation method pioneered at IBM \[Bro93\].</Paragraph> <Paragraph position="4"> The Word Alignment Filter exploits this observation, as illustrated in Figure 3. If word T in a target sentence is the translation of word S in the corresponding source sentence, then words occurring before S in the source sentence will likely correspond to words occurring before T in the target sentence. Likewise, words occurring after S in the source sentence will likely translate to words occurring after T in the target sentence. So S and T can be used as loci for partitioning the source and target sentences into two shorter pairs of corresponding word strings. Each such partition reduces the number of candidate translations from each sentence pair by approximately a factor of two -- an excellent noise filter for the decision procedure.</Paragraph> <Paragraph position="5"> The Word Alignment Filter is particularly useful when oracle lists are available to identify a large number of translation pairs that can be used to partition sentences. Using a LCSR cut-off of 0.58 (optimized using BIBLE, of course), cognates were found for 23% of the source tokens in the training corpus (counting punctuation). 47% of the source tokens were found in the MRBD. Although there was some overlap, an average of 63% of the words in each sentence were paired up with a cognate or with a translation found in the MRBD, leaving few candidate translations for the remaining 37%.</Paragraph> <Paragraph position="6"> The oracles lists often supplied more than one match per word. For instance, several determiners or prepositions in the French sentence often matched the same word in the English sentence. When this happened, the current implementation of the Word Alignment Filter used several heuristics to choose at most one partitioning locus per word. For example, one heuristic says that the order of ideas in a sentence is not likely to change during translation. So, it aimed to minimize crossing partitions, as shown in Figure 4. If word A matches word e, and word D matches words c and g, then D is paired with g, so that when the sentences are written one above the other, the lines connecting the matching words do not cross. Between French and English, this heuristic works quite well, except when it comes to the order between nouns and adjectives.</Paragraph> <Paragraph position="7"> L</Paragraph> </Section> <Section position="5" start_page="192" end_page="192" type="sub_section"> <SectionTitle> 4.5 Evaluation </SectionTitle> <Paragraph position="0"> Table 1 is unusual: It is atypical for more than two of the filters studied here to incrementally improve one lexicon entry. Most lexicon entries are improved by just one or two filters, after which more filtering .gives no significant benefit. However, each filter improves a large number of different entries. Two more examples of the benefits of different filter cascades are given in Tables 3 and 4.</Paragraph> <Paragraph position="1"> Only the most likely translation and the fourth most likely translation in the baseline lexicon are appropriate. The Cognate Filter allows the fourth item, a cognate, to percolate up to second place, and makes room for &quot;two-party&quot; in sixth place.</Paragraph> <Paragraph position="2"> Entry # No Filters Cognate Filter</Paragraph> </Section> </Section> <Section position="8" start_page="192" end_page="195" type="metho"> <SectionTitle> 1 Party 2 Liberal 3 Democratic 4 party 5 Conservative 6 new 7 the </SectionTitle> <Paragraph position="0"> Figures 5 and 6 show mean BIBLE scores for precision of the best translations in lexicons induced with various cascades of the four filters discussed. Assuming that BIBLE scores are normally 0.54 ............................................................................................................................................................................ ~- o. ....... i 0.52 ........... ~ ........................ -,:-:-:::::~-1.::~-~ ....................... i ............................... -,zzz:~ ................... \[\] ............ i ....................... 4- ....................... ~.</Paragraph> <Paragraph position="1"> _ i ............................... ~ ............... i * 0.5 .......... i ............................................................... ::=-:::..::::.~:.~:u.iii~. &quot; .................................. i ....................................... i .... 0.48 .......... ~ ................ =.=-==.-~::~:~-::~ ~.~ ..................... ; .......................................................................... _~ LL:~.C':c:X0.46 .................................................................................................................................................................................... i .... 0.57 ' POS, Cognate &i MRBD Filters wth Wo~d A gnment -~--i 0.44 .................................................................... P.OS,C.~gnate & .MEIBD E It~s...:+::.:.i..</Paragraph> <Paragraph position="2"> ....................................................... ~ ....... ~.:. ....................................................... .~:~::.....-..::. -- ~ :/ :-: ;..--~-~ .......... ~.._ \[ ........... ..\[3 o~ ......... ~ ........................................ i .................................... i ............. ::>~'~&quot;~:&quot;~&quot;~::: .................. ! ................................... ~:~.......... ~: :::::::::::::~:::::~::$:~S ::~f ..................... i ................................................... i ......... ~ :7~ ...................... i lj~ i ........................................................................ i...~1 :,.~:~ ...................................................................... 1.~. ~ i . .~-~ ~ POS & Cognate Filters -~---i ........................ i~:7,~:c ........ ~ ...................................................................................... !.. POS.Filter .~3~,,.!......... i C.Ognate Filter --x--.-i ~--~ i baseline (no filters) -~---~ I I I I Cognate Filter !by itself achieves the best precision for the best-of-N translations, when N > 2.</Paragraph> <Paragraph position="3"> The POS Filter~ only degrades precision for large training corpora.</Paragraph> <Paragraph position="4"> i distributed, 95% confidence intervals were estimated for each score, using ten mutually exclusive training sets of each size. All the confidence intervals were narrower than one percentage point at 500 pairs of!training sentences, and narrower than half of one percentage point at 2000 pairs.</Paragraph> <Paragraph position="5"> b Therefore, BIBLE score differences displayed in Figures 5 and 6 are quite reliable.</Paragraph> <Paragraph position="6"> The upper bound on performance for this task is plotted at 0.57 (see end of Section 3). The better filter cashade produce lexicons whose precision comes close to this mark. The best cascades are up to 137% :more precise than the baseline model. The large MRBD resulted in the most useful filter for this pair of languages. Future research will look into why the MRBD's contribution to lexicon precision decreases with more training data.</Paragraph> <Paragraph position="7"> hundred thousand sentences is used. All the presented filters, except the POS Filter, improve performance even when a large training corpus is available. Evidently, some information that is useful for inducing translation lexicons cannot be inferred from any amount of training data using only simple statistical methods. The best precision for the single best translation is achieved by a cascade of the MRBD, Cognate and Word Alignment Filters. To maximize precision for the best of three or more translations, only the Cognate Filter should be used.</Paragraph> </Section> <Section position="9" start_page="195" end_page="196" type="metho"> <SectionTitle> 5 APPLICATION TO MACHINE-ASSISTED TRANSLATION </SectionTitle> <Paragraph position="0"> A machine translation system should not only translate with high precision, but it should also have good coverage of the source language. So, the product of recall and precision, percent correct, is a good indication of a lexicon's suitability for use with such a system. This statistic actually represents the percentage of words in the target test corpus that would be correctly translated from the source, if the lexicon were used as a simple map, Therefore, if the lexicon is to be used as part of a machine-assisted translation system, then the percent correct score will be inversely proportional to the required post-editing time.</Paragraph> <Paragraph position="1"> A simple strategy was adopted to demonstrate the practical utility of filters presented in this paper. First, the most precise filter cascade was selected by looking at Figure 5. Translations were found for all words in the test source text that had entries in the lexicon induced using that cascade. Then the second most precise filter cascade was selected. Words that the most precise lexicon &quot;didn't know about,&quot; which were found in the second most precise lexicon, were translated next. All the other available lexicons were cascaded this way, in the order of their apparent precision, down to the baseline lexicon. This &quot;cascaded back-off&quot; strategy maintained the recall of the baseline lexicon, while taking advantage of the higher precision produced by various filter cascades.</Paragraph> <Paragraph position="2"> Although more sophisticated translation strategies are certainly possible, BIBLE percent correct scores for cascaded lexicons suffice to test the utility of data filters for machine translation. The results in Figure 8 indicate that the filters described in this paper can be used to improve the performance of lexical transfer models by more than 35%.</Paragraph> </Section> class="xml-element"></Paper>