File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-4003_metho.xml
Size: 42,962 bytes
Last Modified: 2025-10-06 14:09:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4003"> <Title>Improving Machine Translation Performance</Title> <Section position="3" start_page="479" end_page="488" type="metho"> <SectionTitle> 2. A System for Extracting Parallel Sentences from Comparable Corpora </SectionTitle> <Paragraph position="0"> The general architecture of our extraction system is presented in Figure 2. Starting with two large monolingual corpora (a non-parallel corpus) divided into documents, we begin by selecting pairs of similar documents (Section 2.1). From each such pair, we generate all possible sentence pairs and pass them through a simple word-overlap-based filter (Section 2.2), thus obtaining candidate sentence pairs. The candidates are presented to a maximum entropy (ME) classifier (Section 2.3) that decides whether the sentences in each pair are mutual translations of each other.</Paragraph> <Paragraph position="1"> The resources required by the system are minimal: a bilingual dictionary and a small amount of parallel data (used for training the ME classifier). The dictionaries used in our experiments are learned automatically from (out-of-domain) parallel corpora; thus, the only resource used by our system consists of parallel sentences.</Paragraph> <Paragraph position="2"> 2 If such a resource is unavailable, other dictionaries can be used.</Paragraph> <Section position="1" start_page="480" end_page="480" type="sub_section"> <SectionTitle> Munteanu and Marcu Exploiting Non-Parallel Corpora 2.1 Article Selection </SectionTitle> <Paragraph position="0"> Our comparable corpus consists of two large, non-parallel, news corpora, one in English and the other in the foreign language of interest (in our case, Chinese or Arabic). The parallel sentence extraction process begins by selecting, for each foreign article, English articles that are likely to contain sentences that are parallel to those in the foreign one.</Paragraph> <Paragraph position="1"> This step of the process emphasizes recall rather than precision. For each foreign document, we do not attempt to find the best-matching English document, but rather a set of similar English documents. The subsequent components of the system are robust enough to filter out the extra noise introduced by the selection of additional (possibly bad) English documents.</Paragraph> <Paragraph position="2"> We perform document selection using the Lemur IR toolkit (Ogilvie and Callan 2001). We first index all the English documents into a database. For each foreign document, we take the top five translations of each of its words (according to our probabilistic dictionary) and create an English language query. The translation probabilities are only used to choose the word translations; they do not appear in the query. We use the query to run TF-IDF retrieval against the database, take the top 20 English documents returned by Lemur, and pair each of them with the foreign query document.</Paragraph> <Paragraph position="3"> This document matching procedure is both slow (it looks at all possible document pairs, so it is quadratic in the number of documents) and imprecise (due to noise in the dictionary, the query will contain many wrong words). We attempt to fix these problems by using the following heuristic: we consider it likely that articles with similar content have publication dates that are close to each other. Thus, each query is actually run only against English documents published within a window of five days around the publication date of the foreign query document; we retrieve the best 20 of these documents. Each query is thus run against fewer documents, so it becomes faster and has a better chance of getting the right documents at the top.</Paragraph> <Paragraph position="4"> Our experiments have shown that the final performance of the system does not depend too much on the size of the window (for example, doubling the size to 10 days made no difference). However, having no window at all leads to a decrease in the over-all performance of the system.</Paragraph> </Section> <Section position="2" start_page="480" end_page="481" type="sub_section"> <SectionTitle> 2.2 Candidate Sentence Pair Selection </SectionTitle> <Paragraph position="0"> From each foreign document and set of associated English documents, we take all possible sentence pairs and pass them through a word-overlap filter.</Paragraph> <Paragraph position="1"> The filter verifies that the ratio of the lengths of the two sentences is no greater than two. It then checks that at least half the words in each sentence have a translation in the other sentence, according to the dictionary. Pairs that do not fulfill these two conditions are discarded. The others are passed on to the parallel sentence selection stage.</Paragraph> <Paragraph position="2"> This step removes most of the noise (i.e., pairs of non-parallel sentences) introduced by our recall-oriented document selection procedure. It also removes good pairs that fail to pass the filter because the dictionary does not contain the necessary entries; but those pairs could not have been handled reliably anyway, so the overall effect of the filter is to improve the precision and robustness of the system. However, the filter also accepts many wrong pairs, because the word-overlap condition is weak; for instance, stopwords almost always have a translation on the other side, so if a few of the content Computational Linguistics Volume 31, Number 4 words happen to match, the overlap threshold is fulfilled and an erroneous candidate sentence pair is selected.</Paragraph> </Section> <Section position="3" start_page="481" end_page="481" type="sub_section"> <SectionTitle> 2.3 Parallel Sentence Selection </SectionTitle> <Paragraph position="0"> For each candidate sentence pair, we need a reliable way of deciding whether the two sentences in the pair are mutual translations. This is achieved by a Maximum Entropy (ME) classifier (described at length in Section 3), which is the core component of our system. Those pairs that are classified as being translations of each other constitute the output of the system.</Paragraph> <Paragraph position="1"> 3. A Maximum Entropy Classifier for Parallel Sentence Identification In the Maximum Entropy (ME) statistical modeling framework, we impose constraints on the model of our data by defining a set of feature functions. These feature functions emphasize properties of the data that we believe to be useful for the modeling task. For example, for a sentence pair sp, the word overlap (the percentage of words in either sentence that have a translation in the other) might be a useful indicator of whether the sentences are parallel. We therefore define a feature function f (sp), whose value is the word overlap of the sentences in sp.</Paragraph> <Paragraph position="2"> According to the ME principle, the optimal parametric form of the model of our data, taking into account the constraints imposed by the feature functions, is a log linear combination of these functions. Thus, for our classification problem, we have:</Paragraph> <Paragraph position="4"> are the feature functions (indexed both by class and by feature). The resulting model has free parameters l j , the feature weights. The parameter values that maximize the likelihood of a given training corpus can be computed using various optimization algorithms (see [Malouf 2002] for a comparison of such algorithms).</Paragraph> </Section> <Section position="4" start_page="481" end_page="483" type="sub_section"> <SectionTitle> 3.1 Features for Parallel Sentence Identification </SectionTitle> <Paragraph position="0"> For our particular classification problem, we need to find feature functions that distinguish between parallel and non-parallel sentence pairs. For this purpose, we compute and exploit word-level alignments between the sentences in each pair. A word alignment between two sentences in different languages specifies which words in one sentence are translations of which words in the other. Word alignments were first introduced in the context of statistical MT, where they are used to estimate the parameters of a translation model (Brown et al. 1990). Since then, they were found useful in many other NLP applications (e.g., word sense tagging [Diab and Resnik 2002] and question answering [Echihabi and Marcu 2003]).</Paragraph> <Paragraph position="1"> Figures 3 and 4 give examples of word alignments between two English-Arabic sentence pairs from our comparable corpus. Each figure contains two alignments. The one on the left is a correct alignment, produced by a human, while the one on the right Munteanu and Marcu Exploiting Non-Parallel Corpora Figure 3 Alignments between two parallel sentences.</Paragraph> <Paragraph position="2"> was computed automatically. As can be seen from the gloss next to the Arabic words, the sentences in Figure 3 are parallel while the sentences in Figure 4 are not. In a correct alignment between two non-parallel sentences, most words would have no translation equivalents; in contrast, in an alignment between parallel sentences, most words would be aligned. Automatically computed alignments, however, may have incorrect connections; for example, on the right side of Figure 3, the Arabic word issue is connected to the comma; and in Figure 4, the Arabic word at is connected to the English phrase itscasetothe. Such errors are due to noisy dictionary entries and to Figure 4 Alignments between two non-parallel sentences.</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 31, Number 4 shortcomings of the model used to generate the alignments. Thus, merely looking at the number of unconnected words, while helpful, is not discriminative enough. Still, automatically produced alignments have certain additional characteristics that can be exploited.</Paragraph> <Paragraph position="4"> We follow Brown et al. (1993) in defining the fertility of a word in an alignment as the number of words it is connected to. The presence, in an automatically computed alignment between a pair of sentences, of words of high fertility (such as the Arabic word at in Figure 4) is indicative of non-parallelism. Most likely, these connections were produced because of a lack of better alternatives.</Paragraph> <Paragraph position="5"> Another aspect of interest is the presence of long contiguous connected spans, which we define as pairs of bilingual substrings in which the words in one substring are connected only to words in the other substring. Such a span may contain a few words without any connection (a small percentage of the length of the span), but no word with a connection outside the span. Examples of such spans can be seen in Figure 3: the English strings after saudi mediation failed or to the international court of justice together with their Arabic counterparts. Long contiguous connected spans are indicative of parallelism, since they suggest that the two sentences have long phrases in common. And, in contrast, long substrings whose words are all unconnected are indicative of non-parallelism.</Paragraph> <Paragraph position="6"> To summarize, our classifier uses the following features, defined over two sentences and an automatically computed alignment between them.</Paragraph> <Paragraph position="7"> General features (independent of the word alignment): a114 lengths of the sentences, as well as the length difference and length ratio; a114 percentage of words on each side that have a translation on the other side (according to the dictionary).</Paragraph> <Paragraph position="8"> Alignment features: a114 percentage and number of words that have no connection; a114 the top three largest fertilities; a114 length of the longest contiguous connected span; and a114 length of the longest unconnected substring.</Paragraph> </Section> <Section position="5" start_page="483" end_page="484" type="sub_section"> <SectionTitle> 3.2 Word Alignment Model </SectionTitle> <Paragraph position="0"> In order to compute word alignments we need a simple and efficient model. We want to align a large number of sentences, with many out-of-vocabulary words, in reasonable time. We also want a model with as few parameters as possible--preferably only word-for-word translation probabilities.</Paragraph> <Paragraph position="1"> One such model is the IBM Model 1 (Brown et al. 1993). According to this model, given foreign sentence (f</Paragraph> <Paragraph position="3"> ), English sentence (e i1<=i<=l ), and translation prob-</Paragraph> <Paragraph position="5"> ). Thus, each foreign word is aligned to exactly one English word (or to a special NULL token). Due to its simplicity, this model has several shortcomings, some more structural than others (see Moore [2004] for a discussion). Thus, we use a version that is augmented with two simple heuristics that attempt to alleviate some of these shortcomings. Munteanu and Marcu Exploiting Non-Parallel Corpora One possible improvement concerns English words that appear more than once in a sentence. According to the model, a foreign word that prefers to be aligned with such an English word could be equally well aligned with any instance of that word. In such situations, instead of arbitrarily choosing the first instance or a random instance, we attempt to make a &quot;smarter&quot; decision. First, we create links only for those English words that appear exactly once; next, for words that appear more than once, we choose which instance to link with so that we minimize the number of crossings with already existing links.</Paragraph> <Paragraph position="6"> The second heuristic attempts to improve the choice of the most likely English translation of a foreign word. Our translation probabilities are automatically learned from parallel data, and we learn values for both t( f</Paragraph> <Paragraph position="8"> ). We can therefore decide that the most likely English translation of f</Paragraph> <Paragraph position="10"> both sets of probabilities is likely to help us make a better-informed decision.</Paragraph> <Paragraph position="11"> Using this alignment strategy, we follow (Och and Ney 2003) and compute one alignment for each translation direction ( f - e and e - f ), and then combine them. Och and Ney present three combination methods: intersection, union,andrefined (a form of intersection expanded with certain additional neighboring links).</Paragraph> <Paragraph position="12"> Thus, for each sentence pair, we compute five alignments (two modified-IBM-Model-1 plus three combinations) and then extract one set of general features and five sets of alignment features (as described in the previous section).</Paragraph> </Section> <Section position="6" start_page="484" end_page="485" type="sub_section"> <SectionTitle> 3.3 Training and Testing </SectionTitle> <Paragraph position="0"> We create training instances for our classifier from a small parallel corpus. The simplest way to obtain classifier training data from a parallel corpus is to generate all possible sentence pairs from the corpus (the Cartesian product). This generates 5,000 training instances, out of which 5,000 are positive (i.e., belong to class &quot;parallel&quot;) and the rest are negative.</Paragraph> <Paragraph position="1"> One drawback of this approach is that the resulting training set is very imbalanced, i.e., it has many more negative examples than positive ones. Classifiers trained on such data do not achieve good performance; they generally tend to predict the majority class, i.e., classify most sentences as non-parallel (which has indeed been the case in our experiments). Our solution to this is to downsample, i.e., eliminate a number of (randomly selected) negative instances.</Paragraph> <Paragraph position="2"> Another problem is that the large majority of sentence pairs in the Cartesian product have low word overlap (i.e., few words that are translations of each other). As explained in Section 2 (and shown in Figure 2), when extracting data from a comparable corpus, we only apply the classifier on the output of the word-overlap filter. Thus, low-overlap sentence pairs, which would be discarded by the filter, are unlikely to be useful as training examples. We therefore use for training only those pairs from the Cartesian product that are accepted by the word-overlap filter. This has the additional advantage that, since all these pairs have many words in common, the classifier learns to make distinctions that cannot be made based on word overlap alone.</Paragraph> <Paragraph position="3"> To summarize, we prepare our classifier training set in the following manner: starting from a parallel corpus of about 5,000 sentence pairs, we generate all the sentence pairs in the Cartesian product; we discard the pairs that do not fulfill the conditions of the word-overlap filter; if the resulting set is imbalanced, i.e., the ratio of non-parallel to parallel pairs is greater than five, we balance it by removing randomly chosen non-parallel pairs. We then compute word alignments and extract feature values.</Paragraph> <Paragraph position="4"> implementation of the GIS algorithm (Darroch and Ratcliff 1974). Since we are dealing with few parameters and have sufficiently many training instances, using more advanced training algorithms is unlikely to bring significant improvements. We test the performance of the classifier by generating test instances from a different parallel corpus (also around 5,000 sentence pairs) and checking how many of these instances are correctly classified. We prepare the test set by creating the Cartesian product of the sentences in the test parallel corpus and applying the word-overlap filter (we do not perform any balancing). Although we apply the filter, we still conceptually classify all pairs from the Cartesian product in a two-stage classification process: all pairs discarded by the filter are classified as &quot;non-parallel,&quot; and for the rest, we obtain predictions from the classifier. Since this is how we apply the system on truly unseen data, this is the process in whose performance we are interested.</Paragraph> <Paragraph position="5"> We measure the performance of the classification process by computing precision and recall. Precision is the ratio of sentence pairs correctly judged as parallel to the total number of pairs judged as parallel by the classifier. Recall is the ratio of sentence pairs correctly identified as parallel by the classifier to the total number of truly parallel pairs--i.e., the number of pairs in the parallel corpus used to generate the test instances. Both numbers are expressed as percentages. More formally: let classified parallel be the total number of sentence pairs from our test set that the classifier judged as parallel, classified well be the number of pairs that the classifier correctly judged as parallel, and true parallel be the total number of parallel pairs in the test set. Then:</Paragraph> </Section> <Section position="7" start_page="485" end_page="488" type="sub_section"> <SectionTitle> 3.4 Performance Evaluation </SectionTitle> <Paragraph position="0"> There are two factors that influence a classifier's performance: dictionary coverage and similarity between the domains of the training and test instances. We performed evaluation experiments to account for both these factors.</Paragraph> <Paragraph position="1"> All our dictionaries are automatically learned from parallel data; thus, we can create dictionaries of various coverage by learning them from parallel corpora of different sizes. We use five dictionaries, learned from five initial out-of-domain parallel corpora, whose sizes are 100k, 1M, 10M, 50M, and 95M tokens, as measured on the English side.</Paragraph> <Paragraph position="2"> Since we want to use the classifier to extract sentence pairs from our in-domain comparable corpus, we test it on instances generated from an in-domain parallel corpus. In order to measure the effect of the domain difference, we use two training sets: one generated from an in-domain parallel corpus and another one from an out-of-domain parallel corpus.</Paragraph> <Paragraph position="3"> In summary, for each language pair, we use the following corpora: From each initial, out-of-domain corpus, we learn a dictionary. We then take the classifier training and test corpora and, using the method described in the previous section, create two sets of training instances and one set of test instances. We train two classifiers (one on each training set) and evaluate both of them on the test set. The parallel corpora used for generating training and test instances have around 5k sentence pairs each (approximately 150k English tokens), and generate around 10k training instances (for each training set) and 8k test instances.</Paragraph> <Paragraph position="4"> Figure 6 Precision and recall of the Chinese-English classifiers.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 31, Number 4 Figures 5 and 6 show the recall and precision of our classifiers, for both Arabic-English and Chinese-English. The results show that the precision of our classification process is robust with respect to dictionary coverage and training domain. Even when starting from a very small initial parallel corpus, we can build a high-precision classifier. Having a good dictionary and training data from the right domain does help though, mainly with respect to recall.</Paragraph> <Paragraph position="6"> The classifiers achieve high precision because their positive training examples are clean parallel sentence pairs, with high word overlap (since the pairs with low overlap are filtered out); thus, the classification decision frontier is pushed towards &quot;goodlooking&quot; alignments. The low recall results are partly due to the word-overlap filter (the first stage of the classification process), which discards many parallel pairs. If we don't apply the filter before the classifier, the recall results increase by about 20% (with no loss in precision). However, the filter plays a very important role in keeping the extraction pipeline robust and efficient (as shown in Figure 7, the filter discards 99% of the candidate pairs), so this loss of recall is a price worth paying.</Paragraph> <Paragraph position="7"> Classifier evaluations using different subsets of features show that most of the classifier performance comes from the general features together with the alignment features concerning the percentage and number of words that have no connection. However, we expect that in real data, the differences between parallel and non-parallel pairs are less clear than in our test data (see the discussion in Section 7) and can no The amounts of data processed by our system during extraction from the Chinese-English comparable corpus.</Paragraph> <Paragraph position="8"> longer be accounted for only by counting the linked words; thus, the other features should become more important.</Paragraph> </Section> </Section> <Section position="4" start_page="488" end_page="491" type="metho"> <SectionTitle> 4. Data Extraction Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="488" end_page="489" type="sub_section"> <SectionTitle> 4.1 Controlled Experiments </SectionTitle> <Paragraph position="0"> The comparable corpora that we use for parallel sentence extraction are collections of news stories published by the Agence France Presse and Xinhua News agencies. They are parts of the Arabic, English, and Chinese Gigaword corpora which are available from the Linguistic Data Consortium. From these collections, for each language pair, we create an in-domain comparable corpus by putting together articles coming from the same agency and the same time period. Table 1 presents in detail the sources and sizes of the resulting comparable corpora. The remainder of the section presents the various data sets that we extracted automatically from these corpora, under various experimental conditions.</Paragraph> <Paragraph position="1"> In the experiments described in Section 3.4, we started out with five out-of-domain initial parallel corpora of various sizes and obtained five dictionaries and five out-of-domain trained classifiers (per language pair). We now plug in each of these classifiers (and their associated dictionaries) in our extraction system (Section 2) and apply it to our comparable corpora. We thus obtain five Arabic-English and five Chinese-English extracted corpora.</Paragraph> <Paragraph position="2"> Note that in each of these experiments the only resource used by our system is the initial, out-of-domain parallel corpus. Thus, the experiments fit in the framework of interest described in Section 1, which assumes the availability of (limited amounts of) out-of-domain training data and (large amounts of) in-domain comparable data.</Paragraph> <Paragraph position="3"> Table 2 shows the sizes of the extracted corpora for each initial corpus size, for both Chinese-English and Arabic-English. As can be seen, when the initial parallel corpus is very small, the amount of extracted data is also quite small. This is due to the low coverage of the dictionary learned from that corpus. Our candidate pair selection step (Section 2.2) discards pairs with too many unknown (or unrelated) words, according to the dictionary; thus, only few sentences fulfill the word-overlap condition of our filter.</Paragraph> <Paragraph position="4"> As mentioned in Section 1, our goal is to use the extracted data as additional MT training data and obtain better translation performance on a given in-domain MT test set. A simple way of estimating the usefulness of the data for this purpose is to measure its coverage of the test set, i.e., the percentage of running n-grams from the test corpus that are also in our corpus. Tables 3 and 4 present the coverage of our extracted corpora. For each initial corpus size, the first column shows the coverage of that initial corpus, and the second column shows the coverage of the initial corpus plus the extracted corpus. Each cell contains four numbers that represent the coverage with respect to unigrams, bigrams, trigrams, and 4-grams. The numbers show that unigram coverage depends only on the size of the corpus (and not on the domain), but for longer n-grams, our in-domain extracted data brings significant improvements in coverage.</Paragraph> </Section> <Section position="2" start_page="489" end_page="491" type="sub_section"> <SectionTitle> 4.2 Non-Controlled Experiments Using Web-Based Non-Parallel Corpora </SectionTitle> <Paragraph position="0"> The extraction experiments from the previous section are controlled experiments in which we only use limited amounts of parallel data for our extraction system. In this Munteanu and Marcu Exploiting Non-Parallel Corpora section, we describe experiments in which the goal is to assess the applicability of our method to data that we mined from the Web.</Paragraph> <Paragraph position="1"> We obtained comparable corpora from the Web by going to bilingual news websites (such as Al-Jazeera) and downloading news articles in each language independently. In order to get as many articles as possible, we used the web site's search engine to get lists of articles and their URLs, and then crawled those lists. We used the Agent-Builder tool (Ticrea and Minton 2003; Minton, Ticrea, and Beach 2003) for crawling. The tool can be programmed to automatically initiate searches with different parameters and to identify and extract the desired article URLs (as well as other information such as dates and titles) from the result pages. Table 5 shows the sources, time periods, and size of the datasets that we downloaded.</Paragraph> <Paragraph position="2"> For the extraction experiments, we used dictionaries of high coverage, learned from all our available parallel training data. The sizes of these training corpora, measured in number of English tokens, are as follows: We applied our extraction method on both the LDC-released Gigaword corpora and the Web-downloaded comparable corpora. For each language pair, we used the highest precision classifier from those presented in Section 3.4. In order to obtain data of higher quality, we didn't use all the sentences classified as parallel, but only those for which the probability computed by our classifier was higher than 0.70. Table 6 shows the amounts of extracted data, measured in number of English tokens. For Arabic-English, we were able to extract from the Gigaword corpora much more data than in our previous experiments (see Table 2), clearly due to the better dictionary. For Chinese-English, there was no increase in the size of extracted data (although the amount from Table 6 is smaller than that from Table 2, it counts only sentence pairs extracted with confidence higher than 0.70).</Paragraph> <Paragraph position="3"> In the previous section, we measured, for our training corpora, their coverage of the test set (Tables 3 and 4). We repeated the measurements for the training data from Table 6 and obtained very similar results: using the additional extracted data improves coverage, especially for longer n-grams.</Paragraph> <Paragraph position="4"> To give the reader an idea of the amount of data that is funneled through our system, we show in Figure 7 the sizes of the data processed by each of the system's components during extraction from the Gigaword and Web-based Chinese-English comparable corpora. We use a dictionary learned from a parallel corpus on 190M English tokens and a classifier trained on instances generated from a parallel corpus of 220k English tokens. We start with a comparable corpus consisting of 500k Chinese articles and 600k English articles. The article selection step (Section 2.1) outputs 7.5M similar article pairs; from each article pair we generate all possible sentence pairs and obtain 2,400M pairs. Of these, less than 1% (17M) pass the candidate selection stage (Section 2.2) and are presented to the ME classifier. The system outputs 430k sentence pairs (9.5M English tokens) that have been classified as parallel (with probability greater than 0.7).</Paragraph> <Paragraph position="5"> The figure also presents, in the lower part, the parameters that control the filtering at each stage.</Paragraph> <Paragraph position="6"> a114 best K results: in the article selection stage (Section 2.1), for each foreign article we only consider the top K most similar English ones. In our experiments, K is set to 20.</Paragraph> <Paragraph position="7"> a114 date window: when looking for possible article pairs, we only consider English articles whose publication dates fall within a window of 5 days around the publication date of the foreign one.</Paragraph> <Paragraph position="8"> a114 word overlap: the word-overlap filter (Section 2.2) will discard sentence pairs that have less than a certain proportion of words in common (according to the bilingual dictionary). The value we use (expressed as a percentage of sentence length) is 50.</Paragraph> <Paragraph position="9"> a114 length ratio: similarly, the word-overlap filter will discard pairs whose length ratio is greater than this value, which we set to 2.</Paragraph> <Paragraph position="10"> a114 decision threshold: The ME classifier associates a probability with each of its predictions. Values above 0.5 indicate that the classifier considers the particular sentence pair to be parallel; the higher the value, the higher the classifier's confidence. Thus, in order to obtain higher precision, we can choose to define as parallel only those pairs for which the classifier probability is above a certain threshold. In the experiments from Section 4.1, we use the (default) threshold of 0.5, while in Section 4.2 we use 0.7.</Paragraph> </Section> </Section> <Section position="5" start_page="491" end_page="493" type="metho"> <SectionTitle> 5. Machine Translation Improvements </SectionTitle> <Paragraph position="0"> Our main goal is to extract, from an in-domain comparable corpus, parallel training data that improves the performance of an out-of-domain-trained SMT system. Thus, Munteanu and Marcu Exploiting Non-Parallel Corpora we evaluate our extracted corpora by showing that adding them to the out-of-domain training data of a baseline MT system improves its performance.</Paragraph> <Section position="1" start_page="492" end_page="492" type="sub_section"> <SectionTitle> 5.1 Controlled Experiments </SectionTitle> <Paragraph position="0"> We first evaluate the extracted corpora presented in Section 4.1. The extraction system used to obtain each of those corpora made use of a certain initial out-of-domain parallel corpus. We train a Baseline MT system on that initial corpus. We then train another MT system (which we call PlusExtracted) on the initial corpus plus the extracted corpus. In order to compare the quality of our extracted data with that of human-translated data from the same domain, we also train an UpperBound MT system, using the initial corpus plus a corpus of in-domain, human-translated data. For each initial corpus, we use the same amount of human-translated data as there is extracted data (see Table 2). Thus, for each language pair and each initial parallel corpus, we compare 3 MT systems: Baseline, PlusExtracted, and UpperBound.</Paragraph> <Paragraph position="1"> All our MT systems were trained using a variant of the alignment template model described in (Och 2003). Each system used two language models: a very large one, trained on 800 million English tokens, which is the same for all the systems; and a smaller one, trained only on the English side of the parallel training data for that particular system. This ensured that any differences in performance are caused only by differences in the training data.</Paragraph> <Paragraph position="2"> The systems were tested on the news test corpus used for the NIST 2003 MT evaluation. null Translation performance was measured using the automatic BLEU evaluation metric (Papineni et al. 2002) on four reference translations.</Paragraph> <Paragraph position="3"> Figures 8 and 9 show the BLEU scores obtained by our MT systems. The 95% confidence intervals of the scores computed by bootstrap resampling (Koehn 2004) are marked on the graphs; the delta value is around 1.2 for Arabic-English and 1 for Chinese-English.</Paragraph> <Paragraph position="4"> As the results show, the automatically extracted additional training data yields significant improvements in performance over most initial training corpora for both language pairs. At least for Chinese-English, the improvements are quite comparable to those produced by the human-translated data. And, as can be expected, the impact of the extracted data decreases as the size of the initial corpus increases.</Paragraph> <Paragraph position="5"> In order to check that the classifier really does something important, we performed a few experiments without it. After the article selection step, we simply paired each foreign document with the best-matching English one, assumed they are parallel, sentence-aligned them with a generic sentence alignment method, and added the resulting data to the training corpus. The resulting BLEU scores were practically the same as the baseline; thus, our classifier does indeed help to discover higher-quality parallel data.</Paragraph> </Section> <Section position="2" start_page="492" end_page="493" type="sub_section"> <SectionTitle> 5.2 Non-Controlled Experiments </SectionTitle> <Paragraph position="0"> We also measured the MT performance impact of the extracted corpora described in Section 4.2. We trained a Baseline MT system on all our available (in-domain and out-of-domain) parallel data, and a PlusExtracted system on the parallel data plus the extracted in-domain data. Clearly, we have access to no UpperBound system in this case.</Paragraph> <Paragraph position="1"> The results are presented in the first two rows of Table 7. Adding the extracted corpus lowers the score for the Arabic-English system and improves the score for the Chinese-English one; however, none of the differences are statistically significant. Since the baseline systems are trained on such large amounts of data (see Section 4.2), it is not surprising that our extracted corpora have no significant impact. In an attempt to give a better indication of the value of these corpora, we used them alone as MT training data. The BLEU scores obtained by the systems we trained on them are presented in the third row of Table 7. For comparison purposes, the last line of the table shows the scores of systems trained on 10M English tokens of out-of-domain data. As can be seen, our automatically extracted corpora obtain better MT performance than out-of-domain parallel corpora of similar size. It's true that this is not a fair comparison, since the extracted corpora were obtained using all our available parallel data. The numbers do show, however, that the extracted data, although it was obtained automatically, is of good value for machine translation.</Paragraph> </Section> </Section> <Section position="6" start_page="493" end_page="499" type="metho"> <SectionTitle> 6. Bootstrapping </SectionTitle> <Paragraph position="0"> As can be seen from Table 2, the amount of data we can extract from our comparable corpora is adversely affected by poor dictionary coverage. Thus, if we start with very little parallel data, we do not make good use of the comparable corpora. One simple way to alleviate this problem is to bootstrap: after we've extracted some in-domain data, we can use it to learn a new dictionary and go back and extract again. Bootstrapping was also successfully applied to this problem by Fung and Cheung (2004).</Paragraph> <Paragraph position="1"> We performed bootstrapping iterations starting from two very small corpora: 100k English tokens and 1M English tokens, respectively. After each iteration, we trained MT performance improvements for Chinese-English.</Paragraph> <Paragraph position="2"> (and evaluated) an MT system on the initial data plus the data extracted in that iteration. We did not use any of the data extracted in previous iterations since it is mostly a subset of that extracted in the current iteration. We iterated until there were no further improvements in MT performance on our development data.</Paragraph> <Paragraph position="3"> Figures 10 and 11 show the sizes of the data extracted at each iteration, for both initial corpus sizes. Iteration 0 is the one that uses the dictionary learned from the initial corpus. Starting with 100k words of parallel data, we eventually collect 20M words of in-domain Arabic-English data and 90M words of in-domain Chinese-English data.</Paragraph> <Paragraph position="4"> Figures 12 and 13 show the BLEU scores of these MT systems. For comparison purposes, we also plotted on each graph the performance of our best MT system for that language pair, trained on all our available parallel data (Table 7).</Paragraph> <Paragraph position="5"> As we can see, bootstrapping allows us to extract significantly larger amounts of data, which leads to significantly higher BLEU scores. Starting with as little as 100k English tokens of parallel data, we obtain MT systems that come within 7-10 BLEU points of systems trained on parallel corpora of more than 100M English tokens. This Computational Linguistics Volume 31, Number 4 Figure 10 Sizes of the Arabic-English corpora extracted using bootstrapping, in millions of English tokens. shows that using our method, a good-quality MT system can be built from very little parallel data and a large amount of comparable, non-parallel data.</Paragraph> <Paragraph position="6"> 7. Examples We conclude the description of our method by presenting a few sentence pairs extracted by our system. We chose the examples by looking for cases when a given foreign sentence was judged parallel to several different English sentences. Figures 14 and 15 show the foreign sentence in Arabic and Chinese, respectively, followed by a human-produced translation in bold italic font, followed by the automatically extracted matching English sentences in normal font. The sentences are picked from the data sets presented in Section 4.2.</Paragraph> <Paragraph position="7"> The examples reveal the two main types of errors that our system makes. The first type concerns cases when the system classifies as parallel sentence pairs that, although they share many content words, express slightly different meanings, as in Munteanu and Marcu Exploiting Non-Parallel Corpora Figure 12 BLEU scores of the Arabic-English MT systems using bootstrapping.</Paragraph> <Paragraph position="8"> lation of the other, plus additional (often quite long) phrases (Figure 15, examples 1 and 5).</Paragraph> <Paragraph position="9"> These errors are caused by the noise present in the automatically learned dictionaries and by the use of a weak word alignment model for extracting the classifier Figure 13 BLEU scores of the Chinese-English MT systems using bootstrapping.</Paragraph> <Paragraph position="10"> features. In an automatically learned dictionary, many words (especially the frequent, non-content ones) will have a lot of spurious translations. The IBM-1 alignment model takes no account of word order and allows a source word to be connected to arbitrarily many target words. Alignments computed using this model and a noisy, automatically learned, dictionary will contain many incorrect links. Thus, if two sentences share several content words, these incorrect links together with the correct links between the common content words will yield an alignment good enough to make the classifier judge the sentence pair as parallel.</Paragraph> <Paragraph position="11"> The effect of the noise in the dictionary is even more clear for sentence pairs with few words, such as Figure 14, example 6. The sentences in that example are tables of soccer team statistics. They are judged parallel because corresponding digits align Computational Linguistics Volume 31, Number 4 to each other, and according to our dictionary, the Arabic word for &quot;Mexico&quot; can be translated as any of the country names listed in the example.</Paragraph> <Paragraph position="12"> These examples also show that the problem of finding only true translation pairs is hard. Two sentences may share many content words and yet express different meanings (see Figure 14, example 1). However, our task of getting useful MT training data does not require a perfect solution; as we have seen, even such noisy training pairs can help improve a translation system's performance.</Paragraph> </Section> class="xml-element"></Paper>