File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1409_metho.xml
Size: 22,462 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1409"> <Title>Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect?</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Data Collection and Preparation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Obtaining Tamil Data </SectionTitle> <Paragraph position="0"> Tamil data is not very difficult to find on the web.</Paragraph> <Paragraph position="1"> There are several Tamil newspapers and magazines with online editions, and the large international Tamil community fosters the use of the Internet for the dissemination of information. After initial investigation of several web sites we decided to download our experimental corpus from www.tamilnet.com,anews site that provides local news on Sri Lanka in both Tamil and English. The Tamil and English news texts on this site do not seem to be translations of each other. The availability of a fairly large in-domain corpus of local news on Sri Lanka in English (over 2 million words) allowed us to train an in-domain English language model of Sri Lankan news.</Paragraph> </Section> <Section position="2" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.2 Encoding and Tokenization </SectionTitle> <Paragraph position="0"> Tamil is written in a phonematic, non-Latin script.</Paragraph> <Paragraph position="1"> Several encoding schemes exist in parallel. Even though the Unicode standard includes a set of glyphs for Tamil, it is not widely used in practice. Most web sites that offer Tamil language material assume Latin-1 encoding and rely on special true type fonts, which often are also offered for free download at those sites.</Paragraph> <Paragraph position="2"> Tamil text is therefore fairly easy to identify on web sites via the face attribute of the HTML font tag. All that is necessary is a list of Tamil font names used by the different sites, and knowledge about which encodings these fonts implement. While we could restrict ourselves to one data source and encoding for our experiment, any large-scale system would have to take this into account. In order to make the source text recognizable to humans who have no knowledge of Tamil, we decided to work with transliterated text .</Paragraph> </Section> <Section position="3" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.3 Translating the Corpus </SectionTitle> <Paragraph position="0"> Originally we hoped to be able to create a parallel corpus of about 100,000 words on the Tamil side within one month, using several translators. Professional Translations, however, were produced from the original Tamil.</Paragraph> <Paragraph position="1"> translation services in the US currently charge rates of about 30 cents per English word for translations from Tamil into English. Given that the English translation of a Tamil text usually contains about 1.2 times as many words as the Tamil original, the translation of a corpus of 100,000 Tamil words would cost approximately USD 36,000. This was far beyond our budget. In India, by comparison, raw translations may cost as little as one cent per Tamil word . However, outsourcing the translation work abroad was not feasible for us, since we had neither the administrative infrastructure nor the time to manage such an effort. Also, working with partners so remote would have made it very difficult to communicate our exact needs and to implement proper quality control.</Paragraph> <Paragraph position="2"> We finally decided to hire as translators four entering and second-year graduate students in the department of engineering whose native language is Tamil and who had responded to an ad posted in the local mailing list for students from India.</Paragraph> <Paragraph position="3"> In order to manage the corpus translation process, we set up a web interface through which the translators could retrieve source texts and upload their translations, post-editors could post-edit text online, the project progress could be monitored, and all incoming text was available to other project members as soon as it was submitted.</Paragraph> <Paragraph position="4"> We originally assumed that translators would be able to translate about 500 words per hour if we were content with raw translations and hardly any formatting, and if we allowed them to skip difficult words or sentences. This estimate was based on an internal evaluation, in which multilingual members of our group translated sample documents from their native language (Arabic, German, Romanian) into English and kept track of the time they spent on this.</Paragraph> <Paragraph position="5"> It turned out that our expectations were very much exaggerated with both respect to translation speed and the quality of translation. The actual translation speed for Tamil varied between 156 and 247 words per hour with an average of 170 words per hour. In 139 hours of reported translation time (over a period of eventually 6 weeks), about 24,000 words / 1,300 sentences of Tamil text were translated, at an effective cost of ca. 10.8 cents per Tamil word (translators' compensation plus administrative overhead). This figure does not include the effort for manually post-editing the translations by a native speaker of English (12-16 person hours).</Paragraph> <Paragraph position="6"> The overall organization of the project (source data retrieval, hiring and management of the translators, design and implementation of the web interface for managing the project via the Internet, development of transliterator and stemmer, etc.) required an additional Personal communications with Thomas Malten, University of Cologne.</Paragraph> <Paragraph position="7"> estimated 2.5 person months. However, a good part of this effort led to resources that can also be used for other purposes.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.4 Lessons Learned for Future Projects </SectionTitle> <Paragraph position="0"> If we were to give advice for future, similar projects, we would emphasize and recommend the following: 2.4.1 Good translators are not easy to find It is difficult to find good translators for a short-term commitment. Unless one is willing to pay a premium price, it is unlikely that one will find professional translators who are willing to commit much of their time for a limited period of time and on short notice. null 2.4.2 Make the translation job attractive As foreign students, our translators would each have been allowed to work up to twenty hours per week.</Paragraph> <Paragraph position="1"> None of them did, because the work was frustrating and boring, and because they found more attractive, long term employment on campus. Our translators' frustration may have been fostered by several factors: AF the differences between Sri Lankan Tamil (the variety used in our corpus) and the Tamil spoken in Southern India (the native language of our translators), which made translating, according to our translators, very difficult; AF the lack of translation experience of our translators; and AF our high expectations. We originally told our translators that since they were not working on site, we would expect the translation of 500 words per hour reported. When we later switched to hourly pay regardless of translation volume, the translation volume picked up slightly.</Paragraph> <Paragraph position="2"> 2.4.3 Be prepared to post-edit In professional translating, translators typically translate into their native language only. One may not be able to find translators with English as their native language for low density or &quot;small&quot; languages, so it may be necessary to have the translations post-edited by people with greater language proficiency in English.</Paragraph> <Paragraph position="3"> 2.4.4 Have translators and post-editors work on site It is better to have translators and post-editors work on site and ideally as teams, so that they can resolve ambiguities and misunderstandings immediately without the delays of communicating indirectly, be it by email or other means. A post-editor who does not know the source language may misinterpret the translator, as the following case from our corpus illustrates: that have been seen at least 5 times. The error bars indicate standard deviation. Raw translation: Information about the schools in which people who migrated to Kudaanadu are staying is being gathered.</Paragraph> <Paragraph position="4"> Post-edited version: Information about the schools in (sic!) which immigrants to Kudaanadu are attending is being gathered.</Paragraph> <Paragraph position="5"> In this case, the post-editor clearly misinterpreted the translator. What the translator meant to and actually did say is that information was being gathered about the schools in which migrants/war refugees who had arrived in Kudaanadu had found shelter. However, the post-editor interpreted the phrase people who migrated to Kudaana as describing immigrants and assumed that information was being gathered about their education rather than their housing.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Evaluation Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 A Priori Considerations </SectionTitle> <Paragraph position="0"> The richer the morphology of a language is, the greater is the total number of distinct word forms that a given corpus consists of, and the smaller is the probability that a certain word form actually occurs in any given text segment. Figure 1 shows the percentage of word forms in unseen text that have occurred in previously seen text as a function of the amount of previously seen text. The graph on the left shows the curves for English, the one on the right the curves for Sri Lankan Tamil. The graphs show the averages of 100 runs on different text fragments; the error bars indicate standard deviation.</Paragraph> <Paragraph position="1"> The numbers were computed in the following manner: A corpus of 120,000 tokens was split into segments of 1000 tokens each. For each segment D2</Paragraph> <Paragraph position="3"> we computed how many of the tokens had been previously seen in the segments D2 , the lower line shows the percentage of tokens that had been seen at least five times before. For the purpose of statistical NLP, it seems reasonable to assume that the lower curve gives a better indication of how many percent of previously unseen text we can expect to be &quot;known&quot; to a statistical model trained on a corpus of D1 tokens.</Paragraph> <Paragraph position="4"> At a corpus size of 24,000 tokens, which is approximately the size of the parallel corpus we were able to create during our experiment, about 28% of all word forms in previously unseen Sri Lankan Tamil text cannot be found in the corpus, and 50% have been seen less than 5 times. In other words, if we train a system on this data, we can expect it to stumble over every other word! At a corpus size of 100,000 tokens, the numbers are 17% and 33%.</Paragraph> <Paragraph position="5"> For English, the numbers are 9%/23% for a corpus of 24K tokens and 0%/8% for a corpus of 100K tokens. null In order to boost the text coverage we built a simple text stemmer for Tamil, based on the Tamil inflection tables in Steever (1990) and some additional inspection of our parallel corpus. The stemmer uses regular expression matching to cut off inflectional endings and introduce some extra tokens for negation and certain case markings (such as locative and genitive), which are all marked morphologically in Tamil. It should be noted that the stemmer is far from perfect and was only intended to be an interim solution. The performance increases are displayed in Figure 2. For a corpus size Tamil. The solid lines indicate text coverage for unstemmed data (seen at least once and at least five times, respectively), the dashed lines the text coverage for stemmed data.</Paragraph> <Paragraph position="6"> of 24K tokens, the percentages of unknown items drop to 19% (from 28%; never seen before) and 36% (from 50%; seen less than 5 times). For a training corpus of 100K tokens, the numbers are 12% and 23% (from 17%/33%).</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Task-Based Pilot Evaluation </SectionTitle> <Paragraph position="0"> Given these numbers, it is obvious that one cannot expect much performance from a system that relies on models trained on only 24K tokens of data. As a matter of fact, it is close to impossible to make any sense whatsoever of the output of such a system (cf. Fig. 3).</Paragraph> <Paragraph position="1"> To get an estimate of the performance with more training data, we augmented our corpus with a parallel corpus of international news texts in Southern Indian Tamil which was made available to us by Fred Gey of the University of California at Berkeley (henceforth: Berkeley corpus). This corpus contains ca.</Paragraph> <Paragraph position="2"> 3,800 sentence pairs with 75,800 Tamil tokens after stemming (before stemming: 60,000; the difference is due to the introduction of additional markers during stemming). Some of the parallel data was withheld for system evaluation; the augmented training corpus (Berkeley and TamilNet corpus; short B+TN) had a size of 85K tokens on the Tamil side. The augmented training corpus had a text coverage of 81% (seen at least once; 75% without augmentation), and 67% (seen at least 5 times; 60% without augmentation), respectively, for Sri Lankan Tamil. We trained IBM Translation Model 4 (Brown et al., 1993) both on our corpus alone and on the augmented corpus, using the EGYPT toolkit (Knight et al., 1999; Al-Onaizan et al., 1999), and then translated a number of texts using different translation models and different transfer methods, namely glossing (replacing each Tamil word by the most likely candidate from the translation tables created with the EGYPT toolkit) and Model 4 decoding (Brown et al., 1995; Germann et al., 2001).</Paragraph> <Paragraph position="3"> Figure 3 shows the output of the different systems in comparison with the human translation.</Paragraph> <Paragraph position="4"> We then conducted the following experiments.</Paragraph> <Paragraph position="5"> Seven human subjects without any knowledge of Tamil were given translations of a set of 15 texts (all from the Berkeley corpus) and asked to categorize them according to the following topic hierarchy: Except for one duplicate set, each subject received a different set of translations. The sets differed in training parameters and the translation method used. Table 1 shows the results of this evaluation. The difference between the subjects 5a and 5b, who received the same set of translations, suggests that the individual classifiers' accuracy influences the results so much as to blur the effect of the other parameters. There seems to be a tendency for glossing to work better than Model 4 decoding. Glossing, in our system, is a simple base line algorithm that provides the most likely word translation for each word of input. Translation candidates and their probabilities are retrieved from the translation table, which is part of the translation model trained on the parallel corpus.</Paragraph> <Paragraph position="6"> The document classification test is foremost and above all a measure of the quality of the translation table for frequently occurring words. In practice, actual documents into 4 major and 11 minor categories.</Paragraph> <Paragraph position="7"> a pegging causes the training algorithm to consider a larger search space b correct top level category but incorrect sub-category c translation by maximizing the IBM Model 4 probability of the source/translation pair (Brown et al., 1993; Brown et al., 1995) classification might be performed by automatic procedures rather than humans. If we dare to accept the top performances of our human subjects as the tentative upper bound of what can be achieved with the current system using a translation model trained on 85K tokens of Tamil text and the corresponding English translations, we can conclude that the classification accuracy can exceed 86% (13/15) for fine-grained classification and reach 100% for coarse-grained classification. However, given the extremely small sample size in this evaluation, the evidence should not be considered conclusive.</Paragraph> <Paragraph position="8"> The document retrieval task and the question answering task (see below) were combined into one task. The subjects received 14 texts (from the TamilNet corpus) and 15 lead questions plus 13 additional follow-up questions. Their task was to identify the document(s) that contain(s) the answer to the question and to answer the questions asked. Typical lead questions were questions such as What is the security situation in Trinconmalee?,orWho is S. Thivakarasa?; typical follow-up questions were Who is in control of the situation?, What happened to him/her?,orHow did (other) people react to what happened?. As in the previous experiment, each subject received the output of a different system.</Paragraph> <Paragraph position="9"> Table 2 shows the result of the document retrieval task. Again, the sample size was too small to draw any final conclusions, but our results seem to suggest the following. Firstly, the test subject in the group dealing with output of systems trained on the bigger training set tend to perform better than the ones dealing with the results of training on less data. This suggests that the jump from 24K to 85K tokens of training data might improve system performance in a significant manner. We were surprised that even with the poor translation performance of our system, recalls as high as 93% at a precision of 88% could be achieved. Secondly, the data shows that gaps are not randomly distributed over the data, but that some questions clearly seem to have been more difficult than others. One of the particular difficult aspects of the task was the spelling of names. Question 11, for example, asked What happened to Chandra Kumar Abayasingh?. In the translations, however, it was rendered in simple transliteration: cantirakumaara apayacingka.</Paragraph> <Paragraph position="10"> It requires a considerable degree of tenacity and imagination to find this connection.</Paragraph> <Paragraph position="11"> In order to measure the performance in the question answering part of this evaluation, we considered only questions relevant to the documents that the test subjects had identified correctly. Because of the difficulty of the task, we were lenient to some degree in the evaluation. For example, if the correct answer was the former president of the teacher's union and the answer given was an official of the teacher's union, we still counted this as &quot;close enough&quot; and therefore correct. In addition, we also allowed partially correct answers, that is, answers that went into the right direction but were not quite correct. For example, if the correct answer was The army imposed a curfew on fishing,we counted the answer the army is stopping fishing boats as partially correct. All in all, it was very difficult to evaluate this section of the task, because it was often close to impossible to determine whether the answer was just an educated guess or actually based on the text. There were some cases where answers were partially or even fully correct even though the correct document had not been identified. In retrospect we conclude that it would have been better to have the test containing the answers to 15 lead questions. Black dots indicate successful identification of at least one document containing the answer.</Paragraph> <Paragraph position="12"> same as above; 10 training iterations.</Paragraph> <Paragraph position="13"> g same as above, trained with pegging option; 10 training iterations. h Berkeley and TamilNet Corpora, raw (unstemmed); 64439 tokens on Tamil side, 50 training iterations. subjects mark up those text passages in the text that justify their answers.</Paragraph> <Paragraph position="14"> Again, the data suggests that the difference in training corpus size does affect the amount of information that is available from the system output. Subjects using output of a system based on a translation model that was trained on only the TamilNet data tend to perform worse than subjects using output from a system based on a translation model trained on the larger corpus. The poor performance on test set No. 6 may suggest that for this task and at this level of translation quality, glossing provides more informative output than Model 4 decoding. This result is not particularly surprising, since we noticed that Model 4 decoding tends to leave out more words than acceptable. Clearly, this is one area where the translation model has to be improved.</Paragraph> <Paragraph position="15"> Test set 10 is the only set produced by a system using a translation model trained on raw, unstemmed data. It is unclear whether the poor performance on question answering for this test set is due to a principally worse translation quality or the (lack of) tenacity and willingness of the test subject to work her way through the system output.</Paragraph> <Paragraph position="16"> All in all, we were astonished by the amount of information that our test subjects were able to retrieve from the material they received (the top recall for the question answering task is 64%, plus an additional sets are the same as in Table 2. Only questions concerning documents that were identified correctly were considered in this evaluation.</Paragraph> <Paragraph position="17"> 14% partially correct answers!). However, using a system such as the one discussed in the paper is not an option for actual information processing. Especially those subjects that had to deal with the output of systems trained on the smaller corpus experienced the task as utterly frustrating and would not want to do it again.</Paragraph> </Section> </Section> class="xml-element"></Paper>