File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1013_metho.xml
Size: 25,139 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1013"> <Title>From Words to Corpora: Recognizing Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Quantifying Similarity </SectionTitle> <Paragraph position="0"> This section shows how to compute a cross-lingual similarity score, tsim, for two texts.1 Suppose parallel texts are generated according to Melamed's (2000) symmetric word-to-word model (Model A). Let a link be a pair (x;y) where x is a word in language L1 and y is a word in L2. Within a link, one of the words may be NULL, but not both. The model consists of a bilingual dictionary that gives a probability distribution over all possible link types. In the generative process, a sequence of independent link tokens is generated according to that distribution.</Paragraph> <Paragraph position="1"> The links are not observed; only the lexical (non-NULL) words in each language are observed. The texts whose similarity score is to be computed, X and Y, correspond to the mono-lingual lexical projections of the links. For the purposes of this discussion, the texts are viewed as unordered bags of words; scrambling of the link tokens in the two texts is not modeled.</Paragraph> <Paragraph position="2"> An example is illustrated in Figure 1; there are seven link tokens shown, ve of which are lexical in X (the English side) and six of which are lexical in Y (the French side).</Paragraph> <Paragraph position="3"> The next step is to compute the probability of the most probable sequence that could have accounted for the two texts. All permutations of a given link sequence will have the same probability (since the links are generated independently), so the order of the sequence is not important. As noted by Melamed (2000), under the assumption that the quality of a link collection is the sum of the quality of the links, then this problem of nding the best set of links is equivalent to the maximum-weighted bipartite matching (MWBM) problem: Given a weighted bipartite graph G = (V1[V2;E) with nd a matching M E such that each vertex has at most one edge in M, and Pe2M ci;j is maximized. The fastest known MWBM algorithm runs in O(ve + v2 logv) time (Ahuja et al., 1993). Applied to this problem, that is O(max(jXj;jYj)3).</Paragraph> <Paragraph position="4"> The similarity score should be high when many of the link tokens in the best link collection do not involve NULL tokens. Further, it should normalize for text length. Speci cally, the score I use is:</Paragraph> <Paragraph position="6"> This score is an example of Lin's (1998) mathematical de nition of similarity, which is motivated by information theory:</Paragraph> <Paragraph position="8"> where X and Y are any objects generated by a probabilistic model.2 In this research, I seek to show how multiple linguistic resources can be exploited together to recognize translation. The measure in (1) is simpli ed by assuming that all links in a given translation lexicon are equiprobable. (In some cases I use an automatically induced translation lexicon that assigns probabilities to the entries, but for generality the probabilities are ignored.) This reduces the formula in (1) to tsim = # two-word links in best matching# links in best matching : (3) Further, to compute tsim under the equiprobability assumption, we need not compute the MWBM, but only nd the maximum cardinality bipartite matching (MCBM), since all potential links have the same weight. An 2Another approach, due to Jason Eisner (personal communication) would be to use a log-likelihood ratio of two hypotheses: joint vs. separate generation of the two texts (log Pr(all links in the best sequence)Pr(all words in X) Pr(all words in Y) ). In order to make this value (which is the Viterbi approximation to point-wise mutual information between the two texts) a score suitable for comparison between di erent pairs of texts, it must be normalized by length. With normalization, this score is monotonic in Lin's (1998) sim if a uniform unigram model is assumed for the tokens in the single-language models (the denominator terms).</Paragraph> <Paragraph position="9"> O(epv) (or O(jXj jYj pjXj+jYj) for this purpose) algorithm exists for MCBM (Ahuja et al., 1993). If the matching shown in Figure 1 is the MCBM (for some translation lexicon), then tsim(X;Y) = 47 under the simplifying assumption. null If Equation (3) is applied to pairs of documents in the same language, with a \translation lexicon&quot; de ned by the identity relation, then tsim is a variant of resemblance (r), as dened by Broder et al. (1997) for the problem of monolingual duplicate detection:</Paragraph> <Paragraph position="11"> where S(Z) is a shingling of the words in Z; a shingling is the set of uniquen-gram types in the text for some xed n (Damashek, 1995). Unlike Broder et al.'s r, however, tsim is token-based, incorporating word frequency. Speci cally, the intersection of two bags (rather than sets) of tokens contains the minimum count (over the intersected bags) of each type; the union contains the maximum counts, e.g.,</Paragraph> <Paragraph position="13"> With the assumption of equiprobability, any translation lexicon (or, importantly, union thereof) containing a set of word-to-word entries, can be used in computing tsim.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Finding Translations </SectionTitle> <Paragraph position="0"> Formally, the evaluation task I propose can be described as follows: Extract all translation pairs from a pool of 2n texts, where n of them are known to be in language L1 and the other n are known to be in L2. Each text can have one or zero translations in the corpus; let the number of true translation pairs be k.</Paragraph> <Paragraph position="1"> The general technique for completing the task is to rst nd the best matching of words in text pairs (posed as a bipartite matching problem) in order to compute the tsim similarity score. Next, to extract translation pairs of texts from a corpus, nd the best matching of texts based on their pairwise tsim scores, which can be posed as a \higher-level&quot; MWBM problem: by matching the texts using their pair-wise similarity scores, a corpus of pairs of highly similar texts is extracted from the pool.</Paragraph> <Paragraph position="2"> Ifk is known, then the text-matching problem is a generalization of MWBM: Given a weighted bipartite graph G = (V1[V2;E) withjV1j=jV2j and edge weights ci;j, nd a matching M E of size k such that each vertex has at most one edge in M, andPe2M ci;j is maximized. The set of texts inL1 is V1, and the set of texts inL2 is V2; the weights ci;j are the scores tsim(vi;vj). I do not seek a solution to the generalized problem here; one way of approximating it is by taking the top k tsim-scored elements from the set M (the MWBM).</Paragraph> <Paragraph position="3"> If k is not known, it can be estimated (via sampling and human evaluation); I take the approach of varying the estimate of k by applying a threshold on the tsim scores, then computing precision and recall for those pairs in M whose score is above (call this set M ):</Paragraph> <Paragraph position="5"> where T is the set of k true translation pairs.</Paragraph> <Paragraph position="6"> Performance results are presented as (precision, recall) pairs as is lowered.3 Melamed (2000) used a greedy approximation to MWBM called competitive linking, which iteratively selects the edge with the highest weight, links those two vertices, then removes them from the graph. (Ties are broken at random.) A heap-based implementation of competitive linking runs in O(max(jXj;jYj) log max(jXj;jYj)). In the rst experiment, I show a performance comparison between MWBM and competitive linking.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiment: English-Chinese </SectionTitle> <Paragraph position="0"> This experiment used the Hong Kong Hansard English-Chinese parallel corpus. The training corpus is aligned at the sentence level, with segment lengths averaging fteen words (in each language). The test corpus is aligned at the two-sentence level, with segment lengths averaging thirty words. The rst experiment involved ten-fold cross-validation with (for each fold) a training corpus of 9,400 sentence pairs and a test corpus of 1,000 two-sentence pairs. The corpus 3The selection of an appropriate will depend on the application, the corpus, the lexicons, etc. In my evaluation on WWW data, I use a small development set to choose a threshold that maximizes one measure of performance.</Paragraph> <Paragraph position="1"> was randomly divided into folds, and no noise was introduced (i.e., k = n).4</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Translation Lexicon </SectionTitle> <Paragraph position="0"> The main translation lexicon of interest is a union of three word-to-word translation lexicons from di erent sources. I refer to this translation lexicon as UTL.</Paragraph> <Paragraph position="1"> The rst component translation lexicon, DICT, was made from the union of two English-Chinese electronic dictionaries, speci cally, those from Meng et al. (2000) and Levow et al. (2000) (a total of 735,908 entries, many of which are not one-to-one). To make the dictionary exclusively one-to-one entries, each n-tom entry was processed by removing all function words in either side of the entry (according to a language-speci c stoplist), then, if both sides have one or two words (no more), adding all word-pairs in the cross-product (otherwise the entry is discarded).5 The resulting translation lexicon contains 577,655 word pairs, 48,193 of which contain two words that are present in the corpus. This translation lexicon has the advantage of broad coverage, though it does not generally contain names or domain-speci c words, which are likely to be informative, and does not capture morphological variants.</Paragraph> <Paragraph position="2"> The second translation lexicon, TMTL, is automatically generated by training a symmetric word-to-word translation model (Model A, (Melamed, 2000)) on the training corpus.6 All word pairs with nonzero probability were added to the translation lexicon (no smoothing or thresholding was applied). On average (over ten folds), this translation lexicon contained 6,282 entries. The TMTL translation lexicons are expected to capture words speci c to the domain petitive linking), which is the maximum posterior approximation to EM. It is not clear, however, that this change yields performance gains.</Paragraph> <Paragraph position="3"> also contain noise.</Paragraph> <Paragraph position="4"> The third translation lexicon, STR, is the string identity lexicon: (x;y) is in the translation lexicon i the string x is identical to the string y. This translation lexicon captures punctuation, numerals, alphanumeric strings used to label sections, and English words included as-is in the Chinese corpus. There were 3,083 such pairs of word types in the corpus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Filtering </SectionTitle> <Paragraph position="0"> Chen and Nie (2000) note that text pairs that are highly disparate in length are unlikely to be translations. In order to avoid computing tsim scores for all pairs in the cross-product, I eliminated all segment pairs whose lengths are outliers in a linear regression model estimated from the training corpus. Earlier experiments (on a di erent corpus) showed that, if a (1 p)con dence interval is used, the size of the search space reduces exponentially as p increases, while the number of correct translation pairs that do not pass the lter is only linear in p (i.e., the lter gives high recall and high precision). For these experiments, p = 0:05; this value was selected based on the results presented in Smith (2001).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> When the length lter was applied to the 1,000,000 possible pairs in the cross-product, 47.9% of the pairs were eliminated, while 94.5% of the correct pairs were kept, on average (over ten folds). tsim was computed for each pair that passed the lter, then each matching algorithm (MWBM and competitive linking) was applied. As discussed above, a threshold can then be applied to the matching to select the pairs about whose translational equivalence the score is most con dent. Precision and recall plots are shown in Figure 2a. Each line corresponds to a (translation lexicon, matching algorithm) pair, showing average precision and recall over the ten folds as the threshold varies.</Paragraph> <Paragraph position="1"> The plots should be read from left to right, with recall increasing as the threshold is lowered.</Paragraph> <Paragraph position="2"> When many resources are used, the technique is highly adept at selecting the translation pairs.</Paragraph> <Paragraph position="3"> TMTL alone outperforms DICT alone, probably due to its coverage of domain-speci c terms. The competitive linking algorithm lags behind MWBM in most cases, though its performance was slightly better in the case of TMTL. In the case of UTL, for recall up to 0.8251, the thresholded MWBM matching had signi cantly higher precision than the thresholded competitive linking matching at a comparable level of recall (based on a Sign Test over the ten cross-validation folds, p< 0:01).</Paragraph> <Paragraph position="4"> Table 1 shows the maximum performance (by F-score) for each translation lexicon under MWBM and competitive linking.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 E ects of Noise </SectionTitle> <Paragraph position="0"> Next, I performed an experiment to test the technique's robustness to noise. In this case, the test corpus contained 300 known translation pairs (again, two-sentence texts). From 0 to 2700 additional English texts and the same number of Chinese texts were added. These \noise&quot; texts were from the same corpus and were guaranteed not to be aligned with each other.7 The length lter eliminated 48.6% of the 9,000,000 possible pairs in the cross-product, keeping 95.7% of the true pairs. The ltered pairs were tsim-scored using UTL, then the MWBM was computed. Precision and recall are plotted for various levels of noise in Figure 2b.8 Only in the highest-noise condition (kn = 0:1) do we observe a situation where a su ciently strict threshold cannot be used to guarantee an extracted corpus of (nearly) arbitrarily high precision. For example, if 90% precision is required, 88.3%, 60.3%, and 43.7% recall can be guaranteed when kn is 1, 0.5, and 0.25, respectively.</Paragraph> <Paragraph position="1"> These experiments show that with a strict threshold this technique is capable of producing a highly precise matching of parallel text from a noisy corpus, though attainable recall levels drop as noise is added. Performance can be boosted by incorporating additional bilingual resources. Finally, even a fast, greedy approxi7In general, robustness to noise will depend on the source of the noise and how much the noise looks like the true translations. Hence the results presented here may be better or worse than those achieved in speci c applications to which this technique might be applied, depending on those factors, ltering, etc.</Paragraph> <Paragraph position="2"> 8Experiments were carried out for the TMTL and DICT translation lexicons, and also under competitive linking. Space does not permit a full discussion, though it is worth mentioning that, as in the noiseless experiment, UTL outperformed the others, likewise MWBM outperformed competitive linking.</Paragraph> <Paragraph position="3"> Tr. lex. Algorithm prec rec F matching algorithms at their maximal F-scores.</Paragraph> <Paragraph position="4"> Note that thresholds, and tsim scores in general, are comparable only for a given translation lexicon. The STR translation lexicon o ered a boost only when used to supplement TMTL[DICT; when added to each alone it had little or no e ect.</Paragraph> <Paragraph position="5"> top 300 pairs maximum F n prec rec prec rec F pairs are taken (i.e., k is known; in the case of n = 300, the matching contained only 293 pairs), and at maximal F-scores for various levels of noise.</Paragraph> <Paragraph position="6"> mation to the best matching can be useful.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiment: English-French </SectionTitle> <Paragraph position="0"> An important application of translation recognition is the construction of parallel text corpora. One source of raw text in this task is the World-Wide Web, for which several parallel text search systems currently exist (Resnik, 1999; Nie et al., 1999; Ma and Liberman, 1999).</Paragraph> <Paragraph position="1"> These systems propose candidate pairs of pages, which are then classi ed as either translationally equivalent or not. The STRAND system (Resnik, 1999), for example, uses structural markup information from the pages, without looking at their content, to attempt to align them.</Paragraph> <Paragraph position="2"> If the tsim technique can provide a classier that rivals or complements the structural one, using as it does an entirely orthogonal set of features, then perhaps a combined classi er could provide even greater reliability. In addition, custom-quality parallel corpora could be generated from comparable corpora that lack Shown are curves for each of UTL, TMTL, and DICT under both algorithms (MWBM, CL); the maximum F scores are marked (see Table 1). (b.) Precision-recall curves at varying levels of noise. k = 300 in all cases; the circles and dashed line show precision and recall for the top 300 pairs in the matching (i.e., if k were known, it would not make sense to use a lower threshold, so the only reasonable thresholds are to the left), and the squares and dotted line show precision and recall at each condition's maximum F-score|the values are shown in Table 2. (Note that the curves \stop&quot; before reaching a point where recall is 1.0, since a point is eventually reached where no more matches are possible (because of ltering).) structural features. This experiment also shows that tsim is scalable to larger texts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Translation Lexicon </SectionTitle> <Paragraph position="0"> In this experiment, the language pair is English-French. Multiple sources for the translation lexicon are used in a manner similar to Section 4.1.</Paragraph> <Paragraph position="1"> An English-French dictionary (a total of 34,808 entries, 4,021 of which are not one-toone).9 It contains morphological variants but does not include character accents. Each n-tom entry was processed by stoplisting and then extracting all word-pairs in the remaining cross-product as in section 4.1. Result: 39,348 word pairs, 9,045 of which contain two words present in the corpora.</Paragraph> <Paragraph position="2"> A word-to-word translation model (Melamed, 2000) trained on a verse-aligned Bible using MWBM (15,548 verses, averaging 25.5 English words, 23.4 French words after tokenization).</Paragraph> <Paragraph position="3"> Result: 13,762 word pairs.</Paragraph> <Paragraph position="4"> English-French cognate pairs, identi ed using the method of Tiedemann (1999). Space does not permit a full description of the technique; I simply note that cognates were identi ed by thresholding on a specially-trained string-similarity score based on language-speci c character-to-character weights.10 Result: 35,513 word pairs. An additional set of 11,264 exact string matches were added. These entries are quite noisy.</Paragraph> <Paragraph position="5"> The union of these translation lexicons consists of 68,003 unique word pairs. The experiment used only this union translation lexicon.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> In order to compare tsim with structural similarity scoring, I applied it to 325 English-French web-document pairs. These were the same pairs for which human evaluations were carried out by Resnik (1999).11 Note that this is not a matching task; the documents are presented as candidate pairs, and there is no competition among pages for matches in the other language. At different thresholds, a score of agreement (with each of Resnik's (1999) two judges and their 10Tiedemann trained these weights using a list of known cognates; I use a noisy list of weighted translation pairs (speci cally, TMTL) Hence the resources required to extract cognates in this way are no di erent from those required for the translation model.</Paragraph> <Paragraph position="1"> 11One additional pair was thrown out because it contained compressed data; it is assumed that pair would not pass a language identi cation lter.</Paragraph> <Paragraph position="2"> intersection) may be computed for comparison with Resnik's STRAND system, along with recall and precision against a gold standard (for which I use the intersection of the judges|the set of examples where the judges agreed). Note that recall in this experiment is relative to the candidate set proposed by the STRAND search module, not the WWW or even the set of pages encountered in the search.</Paragraph> <Paragraph position="3"> The estimate of tsim (MWBM on the words in the document pair) is not computationally feasible for very large documents and translation lexicons. In preliminary comparisons, I found that representing long documents by as few as their rst 500 words results in excellent performance on the measure. This allows O(1) estimation of tsim for two documents: look only at the rst ( xed) n words of each document. Further, the competitive linking algorithm appears to be as reliable as MWBM.</Paragraph> <Paragraph position="4"> The results reported here approximated tsim in using competitive linking on the rst 500 words.</Paragraph> <Paragraph position="5"> Of the 325 pairs, 32 were randomly selected as a development set. Maximizing on this set yielded a value of = 0:15.12 scores against each judge and their intersection were then computed at that threshold on the test set (the remaining 293 pairs). These are compared to scores of the STRAND system, on the same test set, in Table 3. In every case, the tsim classi er agreed more strongly with the human evaluations. null At = 0:15, precision was 0.680 and recall was 0.921, F = 0:782 (on the same set, STRAND structural classi cation achieved 0.963 precision and 0.684 recall, F = 0:800).</Paragraph> <Paragraph position="6"> Figure 3 shows , precision, and recall plotted against .</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Future Directions </SectionTitle> <Paragraph position="0"> The success of this approach suggests a way to construct parallel corpora from any large, segmented comparable corpus: start with a translation model estimated on a small, high-quality parallel text, and a core dictionary; then extract document pairs with high similarity (tsim) and add them to the parallel corpus. Next, estimate word-level translational equivalence empirically from the enlarged corpus and update 12One could select such a threshold to maximize any objective function over the development set.</Paragraph> <Paragraph position="1"> 294 of the 326 pairs in Resnik's (1999) test set. The STRAND scores are similar to those published by Resnik (1999). The 32 development pairs were used to select the 0.15 threshold. N is the number of examples for which judgement-comparison was possible in each case (human judges were sometimes undecided; those cases are ignored in computing ). threshold : the agreement score with the two judges' intersection, precision, and recall. All measures are on the test set. The score obtained by STRAND is shown as well.</Paragraph> <Paragraph position="2"> the translation lexicon; extract documents iteratively. The experiments presented here show that, even in highly noisy search spaces, tsim can be used with a threshold to extract a high-precision parallel corpus at moderate recall. It is worth noting that the STRAND classier and the tsim classi er disagreed 15% of the time on the test set. A simple combination by disjunction (i.e., \(X;Y) is a translation pair if either classi er says so&quot;) yields precision 0.768, recall 0.961, F = 0:854, and (with the judges' intersection) at 0.878. In future work, more sophisticated combinations of the two classi ers might integrate the advantages of both.</Paragraph> </Section> class="xml-element"></Paper>