File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/n06-2013_abstr.xml

Size: 856 bytes

Last Modified: 2025-10-06 13:44:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2013">
  <Title>Arabic Preprocessing Schemes for Statistical Machine Translation</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML