File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2013_intro.xml
Size: 1,840 bytes
Last Modified: 2025-10-06 14:03:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2013"> <Title>Arabic Preprocessing Schemes for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="49" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The anecdotal intuition in the field is that reduction of word sparsity often improves translation quality.</Paragraph> <Paragraph position="1"> This reduction can be achieved by increasing training data or via morphologically driven preprocessing (Goldwater and McClosky, 2005). Recent publications on the effect of morphology on SMT quality focused on morphologically rich languages such as German (Niessen and Ney, 2004); Spanish, Catalan, and Serbian (Popovi'c and Ney, 2004); and Czech (Goldwater and McClosky, 2005). They all studied 2We conducted several additional experiments that we do not report on here for lack of space but we reserve for a separate technical report.</Paragraph> <Paragraph position="2"> the effects of various kinds of tokenization, lemmatization and POS tagging and show a positive effect on SMT quality. Specifically considering Arabic, Lee (2004) investigated the use of automatic alignment of POS tagged English and affix-stem segmented Arabic to determine appropriate tokenizations. Her results show that morphological preprocessing helps, but only for the smaller corpora. As size increases, the benefits diminish. Our results are comparable to hers in terms of BLEU score and consistent in terms of conclusions. We extend on previous work by experimenting with a wider range of preprocessing schemes for Arabic, by studying the effect of morphological disambiguation (beyond POS tagging) on preprocessing schemes over learning curves, and by investigating the effect on different genres.</Paragraph> </Section> class="xml-element"></Paper>