File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1001_metho.xml
Size: 24,862 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1001"> <Title>Combination of Arabic Preprocessing Schemes for Statistical Machine Translation</Title> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Arabic Linguistic Issues </SectionTitle> <Paragraph position="0"> Arabic is a morphologically complex language with a large set of morphological features1. These features are realized using both concatenative morphology (affixes and stems) and templatic morphology (root and patterns). There is a variety of morphological and phonological adjustments that appear in word orthography and interact with orthographic variations. Next we discuss a subset of these issues that are necessary background for the later sections. We do not address 1Arabic words have fourteen morphological features: POS, person, number, gender, voice, aspect, determiner proclitic, conjunctive proclitic, particle proclitic, pronominal enclitic, nominal case, nunation, idafa (possessed), and mood. derivational morphology (such as using roots as tokens) in this paper.</Paragraph> <Paragraph position="1"> a1 Orthographic Ambiguity: The form of certain letters in Arabic script allows suboptimal orthographic variants of the same word to coexist in the same text. For example, variants of Hamzated Alif, a2a4a3 or a5a7a6 are often written without their Hamza (a8 ): a9 A. These variant spellings increase the ambiguity of words. The Arabic script employs diacritics for representing short vowels and doubled consonants. These diacritics are almost always absent in running text, which increases word ambiguity. We assume all of the text we are using is undiacritized.</Paragraph> <Paragraph position="2"> a1 Clitics: Arabic has a set of attachable clitics to be distinguished from inflectional features such as gender, number, person, voice, aspect, etc. These clitics are written attached to the word and thus increase the ambiguity of alternative readings. We can classify three degrees of cliticization that are applicable to a word base in a strict order:</Paragraph> </Section> <Section position="6" start_page="1" end_page="2" type="metho"> <SectionTitle> [CONJ+ [PART+ [Al+ BASE +PRON]]] </SectionTitle> <Paragraph position="0"> At the deepest level, the BASE can have a definite article (+a10a11a9 Al+ 'the') or a member of the class of pronominal enclitics, +PRON, (e.g. a12a14a13 + +hm 'their/them'). Pronominal enclitics can attach to nouns (as possessives) or verbs and prepositions (as objects). The definite article doesn't apply to verbs or prepositions. +PRON and Al+ cannot co-exist on nouns. Next comes the class of particle proclitics (PART+): +a10 l+ 'to/for', +a15 b+ 'by/with', +a16 k+ 'as/such' and +a17 s+ 'will/future'. b+ and k+ are only nominal; s+ is only verbal and l+ applies to both nouns and verbs.</Paragraph> <Paragraph position="1"> At the shallowest level of attachment we find the conjunctions (CONJ+) +a0 w+ 'and' and +a18 f+ 'so'. They can attach to everything.</Paragraph> <Paragraph position="2"> a1 Adjustment Rules: Morphological features that are realized concatenatively (as opposed to templatically) are not always simply concatenated to a word base. Additional morphological, phonological and orthographic rules are applied to the word. An example of a morphological rule is the feminine morpheme, a19 +p (ta marbuta), which can only be word final. In medial position, it is turned into a20 t. For example, a12a14a13 +a21a23a22a25a24a27a26a7a28 mktbp+hm appears as a12a30a29a14a24a25a22a31a24a27a26a7a28 mktbthm 'their library'. An example of an orthographic rule is the deletion of the Alif (a9 ) of the definite article +a10a11a9 Al+ in nouns when preceded by the preposition +a10 l+ 'to/for' but not with any other prepositional proclitic.</Paragraph> <Paragraph position="3"> a1 Templatic Inflections: Some of the inflectional features in Arabic words are realized templatically by applying a different pattern to the Arabic root. As a result, extracting the lexeme (or lemma) of an Arabic word is not always an easy task and often requires the use of a morphological analyzer. One common example in Arabic nouns is Broken Plurals. For example, one of the plural forms of the Arabic word a0a2a1 a3a4 kAtb 'writer' is a21 a22a25a24 a4 ktbp 'writers'. An alternative non-broken plural (concatenatively derived) is a5a7a6 a22a8a1 a3a4 kAtbwn 'writers'.</Paragraph> <Paragraph position="4"> These phenomena highlight two issues related to the task at hand (preprocessing): First, ambiguity in Arabic words is an important issue to address. To determine whether a clitic or feature should be split off or abstracted off requires that we determine that said feature is indeed present in the word we are considering in context - not just that it is possible given an analyzer. Secondly, once a specific analysis is determined, the process of splitting off or abstracting off a feature must be clear on what the form of the resulting word should be. In principle, we would like to have whatever adjustments now made irrelevant (because of the missing feature) to be removed.</Paragraph> <Paragraph position="5"> This ensures reduced sparsity and reduced unnecessary ambiguity. For example, the word a12a14a29a30a24a31a22a25a24 a4 ktbthm has two possible readings (among others) as 'their writers' or 'I wrote them'. Splitting off the pronominal enclitic a12a30a13 + +hm without normalizing the a20 t to a19 p in the nominal reading leads the coexistence of two forms of the noun a21 a22a25a24 a4 ktbp and a9 a22a25a24 a4 ktbt. This increased sparsity is only worsened by the fact that the second form is also the verbal form (thus increased ambiguity).</Paragraph> </Section> <Section position="7" start_page="2" end_page="4" type="metho"> <SectionTitle> 4 Arabic Preprocessing Schemes </SectionTitle> <Paragraph position="0"> Given Arabic morphological complexity, the number of possible preprocessing schemes is very large since any subset of morphological and orthographic features can be separated, deleted or normalized in various ways. To implement any preprocessing scheme, a preprocessing technique must be able to disambiguate amongst the possible analyses of a word, identify the features addressed by the scheme in the chosen analysis and process them as specified by the scheme. In this section we describe eleven different schemes.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Preprocessing Technique </SectionTitle> <Paragraph position="0"> We use the Buckwalter Arabic Morphological Analyzer (BAMA) (Buckwalter, 2002) to obtain possible word analyses. To select among these analyses, we use the Morphological Analysis and Disambiguation for Arabic (MADA) tool,2 an off-the-shelf resource for Arabic disambiguation (Habash and Rambow, 2005). Being a disambiguation system of morphology, not word sense, MADA sometimes produces ties for analyses with the same inflectional features but different lexemes (resolving such ties require word-sense disambiguation). We resolve these ties in a consistent arbitrary manner: first in a sorted list of analyses.</Paragraph> <Paragraph position="1"> Producing a preprocessing scheme involves removing features from the word analysis and re-generating the word without the split-off features.</Paragraph> <Paragraph position="2"> The regeneration ensures that the generated form is appropriately normalized by addressing various morphotactics described in Section 3. The generation is completed using the off-the-shelf Arabic morphological generation system Aragen (Habash, 2004).</Paragraph> <Paragraph position="3"> This preprocessing technique we use here is the best performer amongst other explored techniques presented in Habash and Sadat (2006).</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.2 Preprocessing Schemes </SectionTitle> <Paragraph position="0"> Table 1 exemplifies the effect of different schemes on the same sentence.</Paragraph> <Paragraph position="1"> a1 ST: Simple Tokenization is the baseline pre-processing scheme. It is limited to splitting off punctuations and numbers from words. For example the last non-white-space string in the example sentence in Table 1, &quot;trkyA.&quot; is split into two tokens: &quot;trkyA&quot; and &quot;.&quot;. An example of splitting numbers from words is the case of the conjunction +a0 w+ 'and' which can prefix numerals such as when a list of numbers is described: 15a0 w15 'and 15'. This scheme requires no disambiguation. Any diacritics that appear in the input are removed in this scheme. This scheme is used as input to produce the other schemes.</Paragraph> <Paragraph position="2"> a1 ON: Orthographic Normalization addresses the issue of sub-optimal spelling in Arabic. We use the Buckwalter answer undiacritized as the orthographically normalized form. An example of ON is the spelling of the last letter in the first and fifth words in the example in Table 1 (wsynhY and AlY, respectively). Since orthographic normalization is tied to the use of MADA and BAMA, all of the schemes we use here are normalized.</Paragraph> <Paragraph position="3"> a1 D1, D2, and D3: Decliticization (degree 1, 2 and 3) are schemes that split off clitics in the order described in Section 3. D1 splits off the class of conjunction clitics (w+ and f+). D2 is the same as D1 plus splitting off the class of particles (l+, k+, b+ and s+). Finally D3 splits off what D2 does in addition to the definite article Al+ and all pronominal enclitics. A pronominal clitic is represented as its feature representation to preserve its uniqueness. (See the third word in the example in Table 1.) This allows distinguishing between the possessive pronoun and object pronoun which often look similar.</Paragraph> <Paragraph position="4"> a1 WA: Decliticizing the conjunction w+. This is the simplest tokenization used beyond ON. It is similar to D1, but without including f+. This is included to compare to evidence in its support as best preprocessing scheme for very large data (Och, 2005).</Paragraph> <Paragraph position="5"> a1 TB: Arabic Treebank Tokenization. This is the same tokenization scheme used in the Arabic Treebank (Maamouri et al., 2004). This is similar to D3 but without the splitting off of the definite article Al+ or the future particle s+.</Paragraph> <Paragraph position="6"> a1 MR: Morphemes. This scheme breaks up words into stem and affixival morphemes. It is identical to the initial tokenization used by Lee (2004).</Paragraph> <Paragraph position="7"> a1 L1 and L2: Lexeme and POS. These reduce a word to its lexeme and a POS. L1 and L2 differ in the set of POS tags they use. L1 uses the simple POS tags advocated by Habash and Rambow (2005) (15 tags); while L2 uses the reduced tag set used by Diab et al. (2004) (24 tags). The latter is modeled after the English Penn POS tag set. For example, Arabic nouns are differentiated for being singular (NN) or Plural/Dual (NNS), but adjectives are not even though, in Arabic, they inflect exactly the same way nouns do.</Paragraph> <Paragraph position="8"> a1 EN: English-like. This scheme is intended to minimize differences between Arabic and English.</Paragraph> <Paragraph position="9"> It decliticizes similarly to D3, but uses Lexeme and POS tags instead of the regenerated word. The POS tag set used is the reduced Arabic Treebank tag set (24 tags) (Maamouri et al., 2004; Diab et al., 2004). Additionally, the subject inflection is indicated explicitly as a separate token. We do not use any additional information to remove specific features using alignments or syntax (unlike, e.g.</Paragraph> <Paragraph position="10"> removing all but one Al+ in noun phrases (Lee, 2004)).</Paragraph> </Section> <Section position="3" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 4.3 Comparing Various Schemes </SectionTitle> <Paragraph position="0"> Table 2 compares the different schemes in terms of the number of tokens, number of out-of-vocabulary (OOV) tokens, and perplexity. These statistics are computed over the MT04 set, which we use in this paper to report SMT results (Section 5). Perplexity is measured against a language model constructed from the Arabic side of the parallel corpus used in the MT experiments (Section 5).</Paragraph> <Paragraph position="1"> Obviously the more verbose a scheme is, the bigger the number of tokens in the text. The ST, ON, L1, and L2 share the same number of tokens because they all modify the word without splitting off any of its morphemes or features. The increase in the number of tokens is in inverse correlation with the number of OOVs and perplexity. The only exceptions are L1 and L2, whose low OOV rate is the result of the reductionist nature of the scheme, which does not preserve morphological information.</Paragraph> </Section> </Section> <Section position="8" start_page="4" end_page="5" type="metho"> <SectionTitle> 5 Basic Scheme Experiments </SectionTitle> <Paragraph position="0"> We now describe the system and the data sets we used to conduct our experiments.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Portage </SectionTitle> <Paragraph position="0"> We use an off-the-shelf phrase-based SMT system, Portage (Sadat et al., 2005). For training, Portage uses IBM word alignment models (models 1 and 2) trained in both directions to extract phrase tables in a manner resembling (Koehn, 2004a). Tri-gram language models are implemented using the SRILM toolkit (Stolcke, 2002). Decoding weights are optimized using Och's algorithm (Och, 2003) to set weights for the four components of the log-linear model: language model, phrase translation model, distortion model, and word-length feature.</Paragraph> <Paragraph position="1"> The weights are optimized over the BLEU metric (Papineni et al., 2001). The Portage decoder, Canoe, is a dynamic-programming beam search algorithm resembling the algorithm described in (Koehn, 2004a).</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.2 Experimental data </SectionTitle> <Paragraph position="0"> All of the training data we use is available from the Linguistic Data Consortium (LDC). We use an Arabic-English parallel corpus of about 5 million words for translation model training data.3 We created the English language model from the English side of the parallel corpus together eTIRR (LDC2004E72), English translation of Arabic Tree-bank (LDC2005E46), and Ummah (LDC2004T18). with 116 million words the English Gigaword Corpus (LDC2005T12) and 128 million words from the English side of the UN Parallel corpus (LDC2004E13).4 English preprocessing simply included lowercasing, separating punctuation from words and splitting off &quot;'s&quot;. The same preprocessing was used on the English data for all experiments.</Paragraph> <Paragraph position="1"> Only Arabic preprocessing was varied. Decoding weight optimization was done using a set of 200 sentences from the 2003 NIST MT evaluation test set (MT03). We report results on the 2004 NIST MT evaluation test set (MT04) The experiment design and choices of schemes and techniques were done independently of the test set. The data sets, MT03 and MT04, include one Arabic source and four English reference translations. We use the evaluation metric BLEU-4 (Papineni et al., 2001) although we are aware of its caveats (Callison-Burch et al., 2006).</Paragraph> </Section> <Section position="3" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.3 Experimental Results </SectionTitle> <Paragraph position="0"> We conducted experiments with all schemes discussed in Section 4 with different training corpus sizes: 1%, 10%, 50% and 100%. The results of the experiments are summarized in Table 3. These results are not English case sensitive. All reported scores must have over 1.1% BLEU-4 difference to be significant at the 95% confidence level for 1% training. For all other training sizes, the difference must be over 1.7% BLEU-4. Error intervals were computed using bootstrap resampling (Koehn, 2004b).</Paragraph> <Paragraph position="1"> Across different schemes, EN performs the best under scarce-resource condition; and D2 performs as best under large resource conditions. The results from the learning curve are consistent with previous published work on using morphological preprocessing for SMT: deeper morph analysis helps for small data sets, but the effect is diminished with more data. One interesting observation is that for our best performing system (D2), the BLEU score at 50% training (35.91) was higher than the baseline ST at 100% training data (34.59).</Paragraph> <Paragraph position="2"> This relationship is not consistent across the rest of the experiments. ON improves over the baseline corpus. We selected portions of additional corpora using a heuristic that picks documents containing the word &quot;Arab&quot; only. The Language model created using this heuristic had a bigger improvement in BLEU score (more than 1% BLEU-4) than a randomly selected portion of equal size.</Paragraph> <Paragraph position="3"> but only statistically significantly at the 1% level. The results for WA are generally similar to D1.</Paragraph> <Paragraph position="4"> This makes sense since w+ is by far the most common of the two conjunctions D1 splits off. The TB scheme behaves similarly to D2, the best scheme we have. It outperformed D2 in few instances, but the difference were not statistically significant. L1 and L2 behaved similar to EN across the different training size. However, both were always worse than EN. Neither variant was consistently better than the other.</Paragraph> </Section> </Section> <Section position="9" start_page="5" end_page="6" type="metho"> <SectionTitle> 6 System Combination </SectionTitle> <Paragraph position="0"> The complementary variation in the behavior of different schemes under different resource size conditions motivated us to investigate system combination. The intuition is that even under large resource conditions, some words will occur very infrequently that the only way to model them is to use a technique that behaves well under poor resource conditions.</Paragraph> <Paragraph position="1"> We conducted an oracle study into system combination. An oracle combination output was created by selecting for each input sentence the output with the highest sentence-level BLEU score.</Paragraph> <Paragraph position="2"> We recognize that since the brevity penalty in BLEU is applied globally, this score may not be the highest possible combination score. The oracle combination has a 24% improvement in BLEU score (from 37.1 in best system to 46.0) when combining all eleven schemes described in this paper. This shows that combining of output from all schemes has a large potential of improvement over all of the different systems and that the different schemes are complementary in some way.</Paragraph> <Paragraph position="3"> In the rest of this section we describe two successful methods for system combination of different schemes: rescoring-only combination (ROC) and decoding-plus-rescoring combination (DRC).</Paragraph> <Paragraph position="4"> All of the experiments use the same training data, test data (MT04) and preprocessing schemes described in the previous section.</Paragraph> <Section position="1" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 6.1 Rescoring-only Combination </SectionTitle> <Paragraph position="0"> This &quot;shallow&quot; approach rescores all the one-best outputs generated from separate scheme-specific systems and returns the top choice. Each scheme-specific system uses its own scheme-specific preprocessing, phrase-tables, and decoding weights.</Paragraph> <Paragraph position="1"> For rescoring, we use the following features: a1 The four basic features used by the decoder: trigram language model, phrase translation model, distortion model, and word-length feature.</Paragraph> <Paragraph position="2"> a1 IBM model 1 and IBM model 2 probabilities in both directions.</Paragraph> <Paragraph position="3"> We call the union of these two sets of features standard.</Paragraph> <Paragraph position="4"> a1 The perplexity of the preprocessed source sentence (PPL) against a source language model as described in Section 4.3.</Paragraph> <Paragraph position="5"> a1 The number of out-of-vocabulary words in the preprocessed source sentence (OOV).</Paragraph> <Paragraph position="6"> a1 Length of the preprocessed source sentence (SL).</Paragraph> <Paragraph position="7"> a1 An encoding of the specific scheme used (SC). We use a one-hot coding approach with 11 separate binary features, each corresponding to a specific scheme.</Paragraph> <Paragraph position="8"> Optimization of the weights on the rescoring features is carried out using the same max-BLEU algorithm and the same development corpus described in Section 5.</Paragraph> <Paragraph position="9"> Results of different sets of features with the ROC approach are presented in Table 4. Using standard features with all eleven schemes, we obtain a BLEU score of 34.87 - a significant drop from the best scheme system (D2, 37.10). Using different subsets of features or limiting the number of systems to the best four systems (D2, TB, D1 and WA), we get some improvements. The best results are obtained using all schemes with standard features plus perplexity and scheme coding. The improvements are small; however they are statistically significant (see Section 6.3).</Paragraph> </Section> <Section position="2" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.2 Decoding-plus-Rescoring Combination </SectionTitle> <Paragraph position="0"> This &quot;deep&quot; approach allows the decoder to consult several different phrase tables, each generated using a different preprocessing scheme; just as with ROC, there is a subsequent rescoring stage.</Paragraph> <Paragraph position="1"> A problem with DRC is that the decoder we use can only cope with one format for the source sentence at a time. Thus, we are forced to designate a particular scheme as privileged when the system is carrying out decoding. The privileged preprocessing scheme will be the one applied to the source sentence. Obviously, words and phrases in the preprocessed source sentence will more frequently match the phrases in the privileged phrase table than in the non-privileged ones. Nevertheless, the decoder may still benefit from having access to all the tables. For each choice of a privileged scheme, optimization of log-linear weights is carried out (with the version of the development set preprocessed in the same privileged scheme).</Paragraph> <Paragraph position="2"> The middle column of Table 5 shows the results for 1-best output from the decoder under different choices of the privileged scheme. The best-performing system in this column has as its privileged preprocessing scheme TB. The decoder for this system uses TB to preprocess the source sentence, but has access to a log-linear combination of information from all 11 preprocessing schemes.</Paragraph> <Paragraph position="3"> The final column of Table 5 shows the results of rescoring the concatenation of the 1-best outputs from each of the combined systems. The rescoring features used are the same as those used for the ROC experiments. For rescoring, a privileged preprocessing scheme is chosen and applied to the development corpus. We chose TB for this (since it yielded the best result when chosen to be privileged at the decoding stage). Applied to 11 schemes, this yields the best result so far: 38.67 BLEU. Combining the 4 best pre-processing schemes (D2, TB, D1, WA) yielded a lower BLEU score (37.73). These results show that combining phrase tables from different schemes have a positive effect on MT performance.</Paragraph> </Section> <Section position="3" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.3 Significance Test </SectionTitle> <Paragraph position="0"> We use bootstrap resampling to compute MT statistical significance as described in (Koehn, 2004a). The results are presented in Table 6. Comparing the 11 individual systems and the two combinations DRC and ROC shows that DRC is significantly better than the other systems - DRC got a max BLEU score in 100% of samples. When excluding DRC from the comparison set, ROC got max BLEU score in 97.7% of samples, while D2 and TB got max BLEU score in 2.2% and 0.1% of samples, respectively. The difference between ROC and D2 and ATB is statistically significant.</Paragraph> </Section> </Section> class="xml-element"></Paper>