File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1001_intro.xml
Size: 2,535 bytes
Last Modified: 2025-10-06 14:03:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1001"> <Title>Combination of Arabic Preprocessing Schemes for Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The anecdotal intuition in the field is that reduction of word sparsity often improves translation quality. This reduction can be achieved by increasing training data or via morphologically driven preprocessing (Goldwater and McClosky, 2005).</Paragraph> <Paragraph position="1"> Recent publications on the effect of morphology on SMT quality focused on morphologically rich languages such as German (Niessen and Ney, 2004); Spanish, Catalan, and Serbian (Popovi'c and Ney, 2004); and Czech (Goldwater and Mc-Closky, 2005). They all studied the effects of various kinds of tokenization, lemmatization and POS tagging and show a positive effect on SMT quality.</Paragraph> <Paragraph position="2"> Specifically considering Arabic, Lee (2004) investigated the use of automatic alignment of POS tagged English and affix-stem segmented Arabic to determine appropriate tokenizations. Her results show that morphological preprocessing helps, but only for the smaller corpora. As size increases, the benefits diminish. Our results are comparable to hers in terms of BLEU score and consistent in terms of conclusions. Other research on preprocessing Arabic suggests that minimal preprocessing, such as splitting off the conjunction +a0 w+ 'and', produces best results with very large training data (Och, 2005).</Paragraph> <Paragraph position="3"> System combination for MT has also been investigated by different researchers. Approaches to combination generally either select one of the hypotheses produced by the different systems combined (Nomoto, 2004; Paul et al., 2005; Lee, 2005) or combine lattices/n-best lists from the different systems with different degrees of synthesis or mixing (Frederking and Nirenburg, 1994; Bangalore et al., 2001; Jayaraman and Lavie, 2005; Matusov et al., 2006). These different approaches use various translation and language models in addition to other models such as word matching, sentence and document alignment, system translation confidence, phrase translation lexicons, etc.</Paragraph> <Paragraph position="4"> We extend on previous work by experimenting with a wider range of preprocessing schemes for Arabic and exploring their combination to produce better results.</Paragraph> </Section> class="xml-element"></Paper>