File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/j00-2004_evalu.xml
Size: 18,635 bytes
Last Modified: 2025-10-06 13:58:38
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-2004"> <Title>Models of Translational Equivalence among Words</Title> <Section position="9" start_page="237" end_page="90000" type="evalu"> <SectionTitle> 6. Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="237" end_page="29614" type="sub_section"> <SectionTitle> 6.1 Evaluation at the Token Level </SectionTitle> <Paragraph position="0"> This section compares translation model estimation methods A, B, and C to each other and to Brown et al.'s (1993b) Model 1. To reiterate, Model 1 is based on co-occurrence information only; Method A is based on the one-to-one assumption; Method B adds the &quot;one sense per collocation&quot; hypothesis to Method A; Method C conditions the auxiliary parameters of Method B on various word classes. Whereas Methods A and B and Model 1 were fully specified in Section 4.3.1 and Section 5, the latter section described a variety of features on which Method C might classify links. For the purposes of the experiments described in this article, Method C employed the simple classification in Table 4 for both languages in the bitext. All classification was performed by table lookup; no context-aware part-of-speech tagger was used. In particular, words that were ambiguous between open classes and closed classes were always deemed to be in the closed class. The only language-specific knowledge involved in this classification Symbols, such as ~ and * the NULL word, in a class by itself Content words: nouns, adjectives, adverbs, non-auxiliary verbs all other words, i.e., function words method is the list of function words in class F. Certainly, more sophisticated word classification methods could produce better models, but even the simple classification in Table 4 should suffice to demonstrate the method's potential.</Paragraph> <Paragraph position="1"> 6.1.1 Experiment 1. Until now, translation models have been evaluated either subjectively (e.g. White and O'Connell 1993) or using relative metrics, such as perplexity with respect to other models (Brown et al. 1993b). Objective and more accurate tests can be carried out using a &quot;gold standard.&quot; I hired bilingual annotators to link roughly 16,000 corresponding words between on-line versions of the Bible in French and English. This bitext was selected to facilitate widespread use and standardization (see Melamed \[1998c\] for details). The entire Bible bitext comprised 29,614 verse pairs, of which 250 verse pairs were hand-linked using a specially developed annotation tool. The annotation style guide (Melamed 1998b) was based on the intuitions of the annotators, so it was not biased towards any particular translation model. The annotation was replicated five times by seven different annotators.</Paragraph> <Paragraph position="2"> Each of the four methods was used to estimate a word-to-word translation model from the 29,614 verse pairs in the Bible bitext. All methods were deemed to have converged when less than .0001 of the translational probability distribution changed from one iteration to the next. The links assigned by each of methods A, B, and C in the last iteration were normalized into joint probability distributions using Equation 19. I shall refer to these joint distributions as Model A, Model B, and Model C, respectively. Each of the joint probability distributions was further normalized into two conditional probability distributions, one in each direction. Since Model 1 is inherently directional, its conditional probability distributions were estimated separately in each direction, instead of being derived from a joint distribution.</Paragraph> <Paragraph position="3"> The four models' predictions were compared to the gold standard annotations.</Paragraph> <Paragraph position="4"> Each model guessed one translation (either stochastically or deterministically, depending on the task) for each word on one side of the gold standard bitext. Therefore, precision = recall here, and I shall refer to the results simply as &quot;percent correct.&quot; The accuracy of each model was averaged over the two directions of translation: English to French and French to English. The five-fold replication of annotations in the test data enabled computation of the statistical significance of the differences in model accuracy. The statistical significance of all results in this section was measured at the c~ -- .05 level, using the Wilcoxon signed ranks test. Although the models were evaluated on part of the same bitext on which they were trained, the evaluations were with respect to the translational equivalence relation hidden in this bitext, not with respect to any of the bitext's visible features. Such testing on training data is standard practice for Computational Linguistics Volume 26, Number 2 unsupervised learning algorithms, where the objective is to compare several methods.</Paragraph> <Paragraph position="5"> Of course, performance would degrade on previously unseen data.</Paragraph> <Paragraph position="6"> In addition to the different translation models, there were two other independent variables in the experiment: method of translation and whether function words were included. Some applications, such as query translation for CLIR, don't care about function words. To get a sense of the relative effectiveness of the different translation model estimation methods when function words are taken out of the equation, I removed from the gold standard all link tokens where one or both of the linked words were closed-class words. Then, I removed all closed-class words (including nonalphabetic symbols) from the models and renormalized the conditional probabilities.</Paragraph> <Paragraph position="7"> The method of translation was either single-best or whole distribution. Single-best translation is the kind that somebody might use to get the gist of a foreign-language document. The input to the task was one side of the gold standard bitext. The output was the model's single best guess about the translation of each word in the input, together with the input word. In other words, each model produced link tokens consisting of input words and their translations. For some applications, it is insufficient to guess only the single most likely translation of each word in the input. The model is expected to output the whole distribution of possible translations for each input word. This distribution is then combined with other distributions that are relevant to the application. For example, for cross-language information retrieval, the translational distribution can be combined with the distribution of term frequencies.</Paragraph> <Paragraph position="8"> For statistical machine translation, the translational distribution can be decoded with a source language model (Brown et al. 1988; A1-Onaizan et al. 1999). To predict how the different models might perform in such applications, the whole distribution task was to generate a whole set of links from each input word, weighted according to the probability assigned by the model to each of the input word's translations. Each model was tested on this task with and without function words.</Paragraph> <Paragraph position="9"> The mean results are plotted in Figures 4 and 5 with 95% confidence intervals.</Paragraph> <Paragraph position="10"> All four graphs in these figures are on the same scale to facilitate comparison. On both tasks involving the entire vocabulary, each of the biases presented in this article improves the efficiency of modeling the available training data. When closed-class words were ignored, Model 1 performed better than Method A, because open-class words are more likely to violate the one-to-one assumption. However, the explicit noise model in Methods B and C boosted their scores significantly higher than Model 1 and Method A. Method B was better than Method C at choosing the single best open-class links, and the situation was reversed for the whole distribution of open-class links.</Paragraph> <Paragraph position="11"> However, the differences in performance between these two methods were tiny on the open-class tasks, because they left only two classes for Method C to distinguish: content words and NULLS. Most of the scores on the whole distribution task were lower than their counterparts on the single-best translation task, because it is more difficult for any statistical method to correctly model the less common translations. The &quot;best&quot; translations are usually the most common.</Paragraph> <Paragraph position="12"> 6.1.2 Experiment 2. To study how the benefits of the various biases vary with training corpus size, I evaluated Models A, B, C, and 1 on the whole distribution translation task, after training them on three different-size subsets of the Bible bitext. The first subset consisted of only the 250 verse pairs in the gold standard. The second subset included these 250 plus another random sample of 2,250 for a total of 2,500, an order of magnitude larger than the first subset. The third subset contained all 29,614 verse pairs in the Bible bitext, roughly an order of magnitude larger than the second subset.</Paragraph> <Paragraph position="13"> All models were compared to the five gold standard annotations, and the scores were Comparison of model performance on single-best translation task. (a) All links; (b) open-class links only.</Paragraph> <Paragraph position="14"> Model 1 Model A Model B Model C Figure 5 Comparison of model performance on whole distribution task. (a) All links; (b) open-class links only.</Paragraph> <Paragraph position="15"> averaged over the two directions of translation, as before. Again, because the total probability assigned to all translations for each source word was one, precision = recall = percent correct on this task. The mean scores over the five gold standard annotations are graphed in Figure 6, where the right edge of the figure corresponds to the means of Figure 5(a). The figure supports the hypothesis in Melamed (to appear, Chapter 7) that the biases presented in this article are even more valuable when the training data are more sparse. The one-to-one assumption is useful, even though it forces us to use a greedy approximation to maximum likelihood. In relative terms, the advantage of the one-to-one assumption is much more pronounced on smaller training sets. For example, Model A is 102% more accurate than Model I when trained on only 250 verse pairs. The explicit noise model buys a considerable gain in accuracy across all sizes of training data, as do the link classes of Model C. In concert, when trained and tested only on the gold standard test set, the three biases outperformed Model 1 by up to 125%. This difference is even more significant given the absolute performance ceiling of 82% established by the interannotator agreement rates on the gold standard.</Paragraph> </Section> <Section position="2" start_page="29614" end_page="90000" type="sub_section"> <SectionTitle> 6.2 Evaluation at the Type Level </SectionTitle> <Paragraph position="0"> An important application of statistical translation models is to help lexicographers compile bilingual dictionaries. Dictionaries are written to answer the question, &quot;What are the possible translations of X?&quot; This is a question about link types, rather than about link tokens.</Paragraph> <Paragraph position="1"> Evaluation by link type is a thorny issue. Human judges often disagree about the degree to which context should play a role in judgments of translational equivalence. For example, the Harper-Collins French Dictionary (Cousin et al. 1990) gives the following French translations for English appoint: nommer, engager, fixer, d~signer. Likewise, most Distribution of link type scores. The long plateaus correspond to the most common li,k~(u,v). 1/1,2/2, and 3/3. combinations of cooc(u,v) * lay judges would not consider instituer a correct French translation of appoint. In actual translations, however, when the object of the verb is commission, task force, panel, etc., English appoint is usually translated into French as instituer. To account for this kind of context-dependent translational equivalence, link types must be evaluated with respect to the bitext whence they were induced.</Paragraph> <Paragraph position="2"> I performed a post hoc evaluation of the link types produced by an earlier version of Method B (Melamed 1996b). The bitext used for this evaluation was the same aligned Hansards bitext used by Gale and Church (1991), except that I used only 300,000 aligned segment pairs to save time. The bitext was automatically pretokenized to delimit punctuation, English possessive pronouns, and French elisions. Morphological variants in both halves of the bitext were stemmed to a canonical form.</Paragraph> <Paragraph position="3"> The link types assigned by the converged model were sorted by the scores in Equation 36. Figure 7 shows the distribution of these scores on a log scale. The log scale helps to illustrate the plateaus in the curve. The longest plateau represents the set of word pairs that were linked once out of one co-occurrence (1/1) in the bitext. All these word pairs were equally likely to be correct. The second-longest plateau resulted from word pairs that were linked twice out of two co-occurrences (2/2) and the third longest plateau is from word pairs that were linked three times out of three co-occurrences (3/3). As usual, the entries with higher scores were more likely to be correct. By discarding entries with lower scores, coverage could be traded for accuracy. This trade-off was measured at three points, representing cutoffs at the end of each of the three longest plateaus.</Paragraph> <Paragraph position="4"> The traditional method of measuring coverage requires knowledge of the correct link types, which is impossible to determine without a gold standard. An approximate coverage measure can be based on the number of different words in the corpus. For Melamed Models of Translational Equivalence Table 5 Lexicon coverage at three different minimum score thresholds. The bitext contained 41,028 different English words and 36,314 different French words, for a total of 77,342.</Paragraph> <Paragraph position="5"> lexicons extracted from corpora, perfect coverage implies at least one entry containing each word in the corpus. One-sided variants, which consider only source words, have also been used (Gale and Church 1991). Table 5 shows both the marginal (one-sided) and the combined coverage at each of the three cutoff points. It also shows the absolute number of (non-NULL) entries in each of the three lexicons. Of course, the size of automatically induced lexicons depends on the size of the training bitext. Table 5 shows that, given a sufficiently large bitext, the method can automatically construct translation lexicons with as many entries as published bilingual dictionaries.</Paragraph> <Paragraph position="6"> The next task was to measure accuracy. It would have taken too long to evaluate every lexicon entry manually. Instead, I took five random samples (with replacement) of 100 entries each from each of the three lexicons. Each of the samples was first compared to a translation lexicon extracted from a machine-readable bilingual dictionary (Cousin et al. 1991). All the entries in the sample that appeared in the dictionary were assumed to be correct. I checked the remaining entries in all the samples by hand. To account for context-dependent translational equivalence, I evaluated the accuracy of the translation lexicons in the context of the bitext whence they were extracted, using a simple bilingual concordancer. A lexicon entry (u,v) was considered correct if u and v ever appeared as direct translations of each other in an aligned segment pair. That is, a link type was considered correct if any of its tokens were correct.</Paragraph> <Paragraph position="7"> Direct translations come in different flavors. Most entries that I checked by hand were of the plain vanilla variety that you might find in a bilingual dictionary (entry type V). However, a significant munber of words translated into a different part of speech (entry type P). For instance, in the entry (protection, prot6g6), the English word is a noun but the French word is an adjective. This entry appeared because to have protection is often translated as ~tre prot~g~ ('to be protected') in the bitext. The entry will never occur in a bilingual dictionary, but users of translation lexicons, be they human or machine, will want to know that translations often happen this way.</Paragraph> <Paragraph position="8"> The evaluation of translation models at the word type level is complicated by the possibility of phrasal translations, such as imm~diatement ~-~ right away. All the methods being evaluated here produce models of translational equivalence between individual words only. How can we decide whether a single-word translation &quot;matches&quot; a phrasal translation? The answer lies in the observation that corpus-based lexicography usually involves a lexicographer. Bilingual lexicographers can work with bilingual concordancing software that can point them to instances of any link type induced from a bitext and display these instances sorted by their contexts (e.g. Simard, Foster, and Perrault 1993). Given an incomplete link type, the lexicographer can usually reconstruct the complete link type from the contexts in the concordance. For example, if the model proposes an equivalence between immddiatement and right, a bilingual concordance can show the lexicographer that the model was really trying to capture the equivalence between imm#diatement and right away or between imm#diatement and right now. I counted incomplete entries in a third category (entry type I). Whether links in this category should be considered correct depends on the application.</Paragraph> <Paragraph position="9"> Table 6 shows the distribution of correct lexicon entries among the types V, P and I. Figure 8 graphs the accuracy of the method against coverage, with 95% confidence intervals. The upper curve represents accuracy when incomplete links are considered correct, and the lower when they are considered incorrect. On the former metric, the method can generate translation lexicons with accuracy and coverage both exceeding 90%, as well as dictionary-size translation lexicons that are over 99% correct.</Paragraph> </Section> </Section> class="xml-element"></Paper>