File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/p05-1032_concl.xml
Size: 2,082 bytes
Last Modified: 2025-10-06 13:54:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1032"> <Title>Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases</Title> <Section position="8" start_page="260" end_page="261" type="concl"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The paper has presented a super-efficient data structure for phrase-based statistical machine translation.</Paragraph> <Paragraph position="1"> We have shown that current table-based methods are unwieldily when used in conjunction with large data sets and long phrases. We have contrasted this with our suffix array-based data structure which provides a very compact way of storing large data sets while simultaneously allowing the retrieval of arbitrarily long phrases.</Paragraph> <Paragraph position="2"> For the NIST-2004 Arabic-English data set, which is among the largest currently assembled for statistical machine translation, our representation uses a very manageable 2 gigabytes of memory. This is less than is needed to store a table containing phrases with a maximum of three words, and is ten times less than the memory required to store a table with phrases of length eight.</Paragraph> <Paragraph position="3"> We have further demonstrated that while computational complexity can make the retrieval of translation of frequent phrases slow, the use of sampling is an extremely effective countermeasure to this.</Paragraph> <Paragraph position="4"> We demonstrated that calculating phrase translation probabilities from sets of 100 occurrences or less results in nearly no decrease in translation quality. The implications of the data structure presented in this paper are significant. The compact representation will allow us to easily scale to parallel corpora consisting of billions of words of text, and the retrieval of arbitrarily long phrases will allow experiments with alternative decoding strategies. These facts in combination allow for an even greater exploitation of training data in statistical machine translation.</Paragraph> </Section> class="xml-element"></Paper>