File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/w01-1409_concl.xml

Size: 3,748 bytes

Last Modified: 2025-10-06 13:53:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1409">
  <Title>Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect?</Title>
  <Section position="5" start_page="2" end_page="2" type="concl">
    <SectionTitle>
4 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have reported on our experience with rapidly building a statistical MT system from scratch. Within ca. 140 translator hours, we were able to create a parallel corpus of about 1300 sentence pairs with 24,000 tokens on the Tamil side, at an average translation rate of approximately 170 Tamil words per hour.</Paragraph>
    <Paragraph position="1"> Very clearly, the effort needed to create parallel data is one of the biggest obstacles to the rapid development of statistical MT systems for new languages.</Paragraph>
    <Paragraph position="2"> With the output of a system which uses a translation model trained on the small amount of parallel data that we created during the course of our experiment, human test subjects achieved a recall of over 50% on the document retrieval task but generally performed poorly on question answering (less than 20%).</Paragraph>
    <Paragraph position="3"> The addition of an additional corpus of 3,800 sentence pairs allowed us to estimate the benefits of increasing the overall corpus size by roughly 300%.</Paragraph>
    <Paragraph position="4"> Based on our experience with translating the TamilNet corpus, this additional effort would require an additional 450 translator and 36 to 48 post-editor hours. With the additional training data, we were able to produce output that increased the performance on our evaluation tasks (document retrieval and question answering) to up to 93% for document retrieval and 64% for question answering.</Paragraph>
    <Paragraph position="5"> With respect to the scenario of &amp;quot;MT in a month&amp;quot;, we can now make the following calculation: If we assume that the average translator translates at a rate of 170 words/hour and is able to spend 6-7 hours per day on actual translations, then a translator can translate about 1000-1200 words per day. In order to translate a corpus of 100,000 words within one month (assuming a five-day work week), we therefore need four to five full time translators. For this effort, we can expect a translation system whose performance resembles the one shown in our evaluation.</Paragraph>
    <Paragraph position="6"> This, of course, raises the following questions, which we are only able to ask but not to answer at this point.</Paragraph>
    <Paragraph position="7"> AF Can the translation model and the algorithms for statistical training be improved so that they require less data to produce acceptable results? AF Are there more efficient uses of scarce resources (such as language experts and translators) for building a statistical (or any other) MT system quickly, for example the creation of less but more informative data, e.g. a parallel corpus with alignments on the word level, or the compilation of a glossary/dictionary of the most frequently used terms? AF How do the various approaches compare with respect to the ratio of construction effort versus performance improvement when the MT systems are scaled up? One approach may show rapid improvements initially but also reach a plateau quickly, whereas another may show slow but steady improvements.</Paragraph>
    <Paragraph position="8"> AF Is there any potential for bootstrapping the resource creation process by using knowledge that can be extracted from little and poor data to speed up the creation of more and better data? These are some of the the questions that will need to be addressed in future research on Quick MT.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML