File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3103_intro.xml
Size: 1,769 bytes
Last Modified: 2025-10-06 14:04:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3103"> <Title>Morpho-syntactic Arabic Preprocessing for Arabic-to-English Statistical Machine Translation</Title> <Section position="3" start_page="15" end_page="15" type="intro"> <SectionTitle> 2 Baseline SMT System </SectionTitle> <Paragraph position="0"> In statistical machine translation, we are given a source language sentence fJ1 = f1 ...fj ...fJ, which is to be translated into a target language sentence eI1 = e1 ...ei ...eI. Among all possible target language sentences, we will choose the sentence with the highest probability:</Paragraph> <Paragraph position="2"> The posterior probability Pr(eI1|fJ1 ) is modeled directly using a log-linear combination of several models (Och and Ney, 2002): The denominator represents a normalization factor that depends only on the source sentence fJ1 . Therefore, we can omit it during the search process. As a decision rule, we obtain:</Paragraph> <Paragraph position="4"> This approach is a generalization of the source-channel approach (Brown et al., 1990). It has the advantage that additional models h(*) can be easily integrated into the overall system. The model scaling factors lM1 are trained with respect to the final translation quality measured by an error criterion (Och, 2003).</Paragraph> <Paragraph position="5"> We use a state-of-the-art phrase-based translation system including the following models: an n-gram language model, a phrase translation model and a word-based lexicon model. The latter two models are used for both directions: p(f|e) and p(e|f). Additionally, we use a word penalty and a phrase penalty. More details about the baseline system can be found in (Zens and Ney, 2004; Zens et al., 2005).</Paragraph> </Section> class="xml-element"></Paper>