File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2051_metho.xml
Size: 7,620 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2051"> <Title>Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation</Title> <Section position="5" start_page="201" end_page="201" type="metho"> <SectionTitle> 2 Arabic Morphology in Recent Work </SectionTitle> <Paragraph position="0"> Arabic-to-English machine translation exemplifies some of the issues caused by the inflection gap. Refer to (Buckwalter, 2005) and (Larkey et al., 2002) for examples that highlight morphological inflection for a simple Modern Standard Arabic (MSA) word and basic stemming operations that we use as our baseline system.</Paragraph> <Paragraph position="1"> (Niessen and Ney, 2000) tackle the inflection gap for German-to-English word alignment by performing a series of morphological operations on the German text. They fragment words based on a full morphological analysis of the sentence, but need to use domain specific and hand written rules to deal with ambiguous fragmentation. (Niessen and Ney, 2004) also extend the corpus by annotating each source word with morphological information and building a hierarchical lexicon. The experimental results show dramatic improvements from sentence-level restructuring (question inversion, separated verb prefixes and merging phrases), but limited improvement from the hierarchical lexicon, especially as the size of the training data increases.</Paragraph> <Paragraph position="2"> We conduct our morphological analysis at the word level, using Buckwalter Arabic Morphological Analyzer (BAMA) version 2.0 (Buckwalter, 2004).</Paragraph> <Paragraph position="3"> BAMA analyzes a given surface word, returning a set of potential segmentations (order of a dozen) for the source word into prefixes, stems, and suffixes.</Paragraph> <Paragraph position="4"> Our techniques select the appropriate splitting from that set by taking into account the target sides (full sentences) of that word's occurrences in the training corpus. We now describe each splitting technique that we apply.</Paragraph> <Section position="1" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 2.1 BAMA: Simple fragment splitting </SectionTitle> <Paragraph position="0"> We begin by simply replacing each Arabic word with the fragments representing the first of the possible splittings returned by the BAMA tool. BAMA uses simple word-based heuristics to rank the splitting alternatives.</Paragraph> </Section> <Section position="2" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 2.2 CONTEXT: Single Sense selection </SectionTitle> <Paragraph position="0"> In the step CONTEXT, we take advantage of the gloss information provided in BAMA's lexicon.</Paragraph> <Paragraph position="1"> Each potential splitting corresponds to a particular choice of prefix, stem and suffix, all of which exist in the manually constructed lexicon, along with a set of possible translations (glosses) for each fragment.</Paragraph> <Paragraph position="2"> We select a fragmentation (choice of splitting for the source word) whose corresponding glosses have the most target side matches in the parallel translation (of the full sentence). The choice of fragmentation is saved and used for all occurrences of the surface form word in training and testing, introducing context sensitivity without parsing solutions. In case of unseen words during testing, we segment it simply using the first alternative from the BAMA tool. This allows us to still translate an unseen test word correctly even if the surface form was never seen during training.</Paragraph> </Section> <Section position="3" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 2.3 CORRMATCH: Correspondence matching </SectionTitle> <Paragraph position="0"> The Arabic language often encodes linguistic information within the surface word form that is not present in English. Word fragments that represent this missing information are misleading in the translation process unless explicitly aligned to the NULL word on the target side. In this step we explicitly remove fragments that correspond to lexical information that is not represented in English. While (Lee, 2004) builds part of speech models to recognize such elements, we use the fact that their corresponding English translations in the BAMA lexicon are empty. Examples of such fragments are case and gender markers. As an example of CORRMATCH removal, we present the Arabic sentence &quot; h'*A lA ya zAl u gayor naZiyf &quot; (after BAMA only) which becomes &quot;h'*A lA ya zAl gayor naZiyf&quot; after the CORRMATCH stage. The &quot;u&quot; has been removed.</Paragraph> </Section> </Section> <Section position="6" start_page="201" end_page="202" type="metho"> <SectionTitle> 3 Experimental Framework </SectionTitle> <Paragraph position="0"> We evaluate the impact of inflectional splitting on the BTEC (Takezawa et al., 2002) IWSLT05 Arabic language data track. The &quot;Supplied&quot; data track includes a 20K Arabic/English sentence pair training set, as well as a development (&quot;DevSet&quot;) and test (&quot;Test05&quot;) set of 500 Arabic sentences each and 16 reference translations per Arabic sentence. Details regarding the IWSLT evaluation criteria and data topic and collection methods are available in (Eck and Hori, 2005). We also evaluate on test and development data randomly sampled from the complete supplied dev and test data, due to considera- null tions noted by (Josep M.Crego, 2005) regarding the similarity of the development and test data sets.</Paragraph> <Section position="1" start_page="202" end_page="202" type="sub_section"> <SectionTitle> 3.1 System description </SectionTitle> <Paragraph position="0"> Translation experiments were conducted using the (Vogel et al., 2003) system with reordering and future cost estimation. We trained translation parameters for 10 scores (language model, word and phrase count, and 6 translation model scores from (Vogel, 2005) ) with Minimum Error Rate training on the development set. We optimized separately for both the NIST (Doddington, 2002) and the BLEU metrics (Papineni et al., 2002).</Paragraph> </Section> </Section> <Section position="7" start_page="202" end_page="202" type="metho"> <SectionTitle> 4 Translation Results </SectionTitle> <Paragraph position="0"> Table 1 and 2 shows the results of each stage of inflectional splitting on the BLEU and NIST metrics. Basic orthographic normalization serves as a baseline (merging all Alif, tar marbuta, ee forms to the base form). The test set NIST scores show steady improvements of up to 5 percent relative, as more sophisticated splitting techniques are used, ie BAMA+CONTEXT+CORRMATCH.</Paragraph> <Paragraph position="1"> These improvements are statistically significant over the baseline in both metrics as measured by the techniques in (Zhang and Vogel, 2004).</Paragraph> <Paragraph position="2"> Our NIST results for all the final stages of inflectional splitting would place us above the top NIST scores from the ISWLT evaluation on the supplied test set.</Paragraph> <Paragraph position="3"> On both DevSet/Test05 and the randomly split data, we see more dramatic improvements in the NIST scores than in BLEU. This might be due to the NIST metric's sensitivity to correctly translating certain high gain words in the test corpus. Inflectional splitting techniques that cause previously unknown surface form words to be translated correctly after splitting can significantly impact the overall score.</Paragraph> </Section> class="xml-element"></Paper>