File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3103_metho.xml

Size: 18,035 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3103">
  <Title>Morpho-syntactic Arabic Preprocessing for Arabic-to-English Statistical Machine Translation</Title>
  <Section position="4" start_page="15" end_page="16" type="metho">
    <SectionTitle>
3 Preprocessing and Normalization Tools
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
3.1 Tokenizer
</SectionTitle>
      <Paragraph position="0"> As for other languages, the corpora must be first tokenized. Here words and punctuations (except abbreviation) must be separated. Another criterion is that Arabic has some characters that appear only at the end of a word. We use this criterion to separate words that are wrongly attached to each other.</Paragraph>
    </Section>
    <Section position="2" start_page="15" end_page="16" type="sub_section">
      <SectionTitle>
3.2 Normalization and Simplification
</SectionTitle>
      <Paragraph position="0"> The Arabic written language does not contain vowels, instead diacritics are used to define the pronunciation of a word, where a diacritic is written under or above each character in the word. Usually these diacritics are omitted, which increases the ambiguity of a word. In this case, resolving the ambiguity of a word is only dependent on the context. Sometimes, the authors write a diacritic on a word to help the reader and give him a hint which word is really meant. As a result, a single word with the same meaning can be written in different ways. For example $Eb (char49char2e charaa char11char83) can be read1 as sha'ab (Eng. nation) or sho'ab (Eng. options). If the author wants to give the reader a hint that the second word is meant, he 1There are other possible pronunciations for the word $Eb than the two mentioned.</Paragraph>
      <Paragraph position="1">  can write $uEb (char49char2e charaa char0cchar11char83) or $uEab (char49char2e char0bcharaa char0cchar11char83). To avoid this problem we normalize the text by removing all diacritics.</Paragraph>
      <Paragraph position="2"> After segmenting the text, the size of the sentences increases rapidly, where the number of the stripped article Al is very high. Not every article in an Arabic sentence matches to an article in the target language. One of the reasons is that the adjective in Arabic gets an article if the word it describes is definite. So, if a word has the prefix Al, then its adjective will also have Al as a prefix. In order to reduce the sentence size we decide to remove all these articles that are supposed to be attached to an adjective. Another way for determiner deletion is described in (Lee, 2004).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="16" end_page="18" type="metho">
    <SectionTitle>
4 Word Segmentation
</SectionTitle>
    <Paragraph position="0"> One way to simplify inflected Arabic text for a SMT system is to split the words in prefixes, stem and suffixes. In (Lee et al., 2003), (Diab et al., 2004) and (Habash and Rambow, 2005) three supervised segmentation methods are introduced. However, in these works the impact of the segmentation on the translation quality is not studied. In the next subsections we will shortly describe the method of (Diab et al., 2004). Then we present our unsupervised methods. null</Paragraph>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
4.1 Supervised Learning Approach (SL)
</SectionTitle>
      <Paragraph position="0"> (Diab et al., 2004) propose solutions to word segmentation and POS Tagging of Arabic text. For the purpose of training the Arabic TreeBank is used, which is an Arabic corpus containing news articles of the newswire agency AFP. In the first step the text must be transliterated to the Buckwalter transliteration, which is a one-to-one mapping to ASCII characters. In the second step it will be segmented and tokenized. In the third step a partial lemmatization is done. Finally a POS tagging is performed. We will test the impact of the step 3 (segmentation + lemmatization) on the translation quality using our phrase based system described in Section 2.</Paragraph>
    </Section>
    <Section position="2" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
4.2 Frequency-Based Approach (FB)
</SectionTitle>
      <Paragraph position="0"> We provide a set of all prefixes and suffixes and their possible combinations. Based on this set, we may have different splitting points for a given compound word. We decide whether and where to split the composite word based on the frequency of different resulting stems and on the frequency of the compound word, e.g. if the compound word has a higher frequency than all possible stems, it will not be split. This simple heuristic harmonizes the corpus by reducing the size of vocabulary, singletons and also unseen words from the test corpus. This method is very similar to the method used for splitting German compound words (Koehn and Knight, 2003).</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
4.3 Finite State Automaton-Based Approach
</SectionTitle>
      <Paragraph position="0"> (FSA) To segment Arabic words into prefixes, stem and one suffix, we implemented two finite state automata. One for stripping the prefixes and the other for the suffixes. Then, we append the suffix automaton to the other one for stripping prefixes. Figure 1 shows the finite state automaton for stripping all possible prefix combinations. We add the prefix s (char80), which changes the verb tense to the future, to the set of prefixes which must be stripped (see table 1). This prefix can only be combined with w and f. Our motivation is that the future tense in English is built by adding the separate word &amp;quot;will&amp;quot;.</Paragraph>
      <Paragraph position="1"> The automaton showed in Figure 1 consists of the following states:  the resulting stem exists already in the text.</Paragraph>
      <Paragraph position="2"> * WF: is achieved if the word begins with w or f. * And the states , K, L, B and AL are achieved if the word begins with s, k, l, b and Al, respectively. null To minimize the number of wrong segmentations, we restricted the transition from one state to the other to the condition that the produced stem occurs at least one time in the corpus. To ensure that most compound words are recognized and segmented, we run the segmenter itteratively, where after each iteration the newly generated words are added to the vocabulary. This will enable recognizing new compound words in the next iteration. Experiments showed that running the segmenter twice is sufficient and in higher iterations most of the added segmentations are wrong.</Paragraph>
    </Section>
    <Section position="4" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
4.4 Improved Finite State Automaton-Based
Approach (IFSA)
</SectionTitle>
      <Paragraph position="0"> Although we restricted the finite state segmenter in such a way that words will be segmented only if the yielded stem already exists in the corpus, we still get some wrongly segmented words. Thus, some new stems, which do not make sense in Arabic, occur in the segmented text. Another problem is that the finite state segmenter does not care about ambiguities and splits everything it recognizes. For example let us examine the word frd (char58char51char09charaf). In one case, the character f is an original one and therefore can not be segmented. In this case the word means &amp;quot;person&amp;quot;. In the other case, the word can be segmented to &amp;quot;f rd&amp;quot; (which means &amp;quot;and then he answers&amp;quot; or &amp;quot;and then an answer&amp;quot;). If the words Alfrd, frd and rd(char58char51char09charaf char2cchar58char51char09charaecharcbchar40 and char58char50) occur in the corpus, then the finite state segmenter will transform the Alfrd (which means &amp;quot;the person&amp;quot;) to Al f rd (which can be translated to &amp;quot;the and then he answers&amp;quot;). Thus the meaning of the original word is distorted. To solve all these problems, we improved the last approach in a way that prefixes and suffixes are recognized simultaneously. The segmentation of the ambiguous word will be avoided. In doing that, we intend to postpone resolving such ambiguities to our SMT system.</Paragraph>
      <Paragraph position="1"> The question now is how can we avoid the segmentation of ambiguous words. To do this, it is sufficient to find a word that contains the prefix as an original character. In the last example the word Alfrd contains the prefix f as an original character and therefore only Al can be stripped off the word. The next question we can ask is, how can we decide if a character belongs to the word or is a prefix. We can extract this information using the invalid prefix combinations. For example Al is always the last prefix that can occur. Therefore all characters that occur in a word after Al are original characters. This method can be applied for all invalid combinations to extract new rules to decide whether a character in a word is an original one or not.</Paragraph>
      <Paragraph position="2"> On the other side, all suffixes we handle in this work are pronouns. Therefore it is not possible to combine them as a suffix. We use this fact to make a decision whether the end characters in a word are original or can be stripped. For example the word trkhm (chard1chareacharbbchar51char10char4b) means &amp;quot;he lets them&amp;quot;. If we suppose that hm is a suffix and therefore must be stripped, then we can conclude that k is an original character and not a suffix. In this way we are able to extract from the corpus itself decisions whether and how a word can be segmented.</Paragraph>
      <Paragraph position="3"> In order to implement these changes the original automaton was modified. Instead of splitting a word we mark it with some properties which corespond to the states traversed untill the end state. On the  other side, we use the technique described above to generate negative properties which avoid the corresponding kind of splitting. If a property and its negation belong to the same word then the property is removed and only the negation is considered. At the end each word is split corresponding to the properties it is marked with.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="18" end_page="19" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
5.1 Corpus Statistics
</SectionTitle>
      <Paragraph position="0"> The experiments were carried out on two tasks: the corpora of the Arabic-English NIST task, which contain news articles and UN reports, and the Arabic-English corpus of the Basic Travel Expression Corpus (BTEC) task, which consists of typical travel domain phrases (Takezawa et al., 2002).</Paragraph>
      <Paragraph position="1"> The corpus statistics of the NIST and BTEC corpora are shown in Table 3 and 5. The statistics of the news part of NIST corpus, consisting of the Ummah, ATB, ANEWS1 and eTIRR corpora, is shown in Table 4. In the NIST task, we make use of the NIST 2002 evaluation set as a development set and NIST 2004 evaluation set as a test set. Because the test set contains four references for each senence we decided to use only the first four references of the development set for the optimization and evaluation.</Paragraph>
      <Paragraph position="2"> In the BTEC task, C-Star'03 and IWSLT'04 copora are considered as development and test sets, respectively. null</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
5.2 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The commonly used criteria to evaluate the translation results in the machine translation community are: WER (word error rate), PER (positionindependent word error rate), BLEU (Papineni et al., 2002), and NIST (Doddington, 2002). The four criteria are computed with respect to multiple references. The number of reference translations per source sentence varies from 4 to 16 references. The evaluation is case-insensitive for BTEC and case-sensitive for NIST task. As the BLEU and NIST scores measure accuracy, higher scores are better.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
5.3 Translation Results
</SectionTitle>
      <Paragraph position="0"> To study the impact of different segmentation methods on the translation quality, we apply different word segmentation methods to the Arabic part of the BTEC and NIST corpora. Then, we make use of the phrase-based machine translation system to translate the development and test sets for each task.</Paragraph>
      <Paragraph position="1"> First, we discuss the experimental results on the BTEC task. In Table 6, the translation results on the BTEC corpus are shown. The first row of the table is the baseline system where none of the segmentation methods is used. All segmentation methods improve the baseline system, except the SL segmentation method on the development corpus. The best performing segmentation method is IFSA which generates the best translation results based on all evaluation criteria, and it is consistent over both development and evaluation sets. As we see, the segmentation of Arabic words has a noticeable impact in improving the translation quality on a small corpus.</Paragraph>
      <Paragraph position="2"> To study the impact of word segmentation methods on a large task, we conduct two sets of experiments on the NIST task using two different amounts of the training corpus: only news corpora, and full corpus. In Table 7, the translation results on the NIST task are shown when just the news corpora were used to train the machine translation models.</Paragraph>
      <Paragraph position="3"> As the results show, except for the FB method, all segmentation methods improve the baseline system.</Paragraph>
      <Paragraph position="4"> For the NIST task, the SL method outperforms the other segmentation methods, while it did not achieve good results when comparing to the other methods in the BTEC task.</Paragraph>
      <Paragraph position="5"> We see that the SL, FSA and IFSA segmentation methods consistently improve the translation results in the BTEC and NIST tasks, but the FB method failed on the NIST task, which has a larger training corpus . The next step is to study the impact of the segmentation methods on a very large task, the NIST full corpus. Unfortunately, the SL method failed on segmenting the large UN corpus, due to the large processing time that it needs. Due to the negative results of the FB method on the NIST news corpora, and very similar results for FSA and IFSA, we were interested to test the impact of IFSA on the NIST full corpus. In Table 8, the translation results of the baseline system and IFSA segmentation method for the NIST full corpus are depicted. As it is shown in table, the IFSA method slightly improves the translation results in the development and test sets.</Paragraph>
      <Paragraph position="6"> The IFSA segmentation method generates the best results among our proposed methods. It acheives consistent improvements in all three tasks over the baseline system. It also outperforms the SL  segmentation on the BTEC task.</Paragraph>
      <Paragraph position="7"> Although the SL method outperforms the IFSA method on the NIST tasks, the IFSA segmentation method has a few notable advantages over the SL system. First, it is consistent in improving the base-line system over the three tasks. But, the SL method failed in improving the BTEC development corpus.</Paragraph>
      <Paragraph position="8"> Second, it is fast and robust, and capable of being applied to the large corpora. Finally, it employs an unsupervised learning method, therefore can easily cope with a new task or corpus.</Paragraph>
      <Paragraph position="9"> We observe that the relative improvement over the baseline system is decreased by increasing the size of the training corpus. This is a natural effect of increasing the size of the training corpus. As the larger corpus provides higher probability to have more samples per word, this means higher chance to learn the translation of a word in different contexts. Therefore, larger training corpus makes a better translation system, i.e. a better baseline, then it would be harder to outperform this better system.</Paragraph>
      <Paragraph position="10"> Using the same reasoning, we can realize why the FB method achieves good results on the BTEC task, but not on the NIST task. By increasing the size of the training corpus, the FB method tends to segment words more than the IFSA method. This over-segmentation can be compensated by using longer phrases during the translation, in order to consider the same context compared to the non-segmented corpus. Then, it would be harder for a phrase-based machine translation system to learn the translation of a word (stem) in different contexts.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="19" end_page="20" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We presented three methods to segment Arabic words: a supervised learning approach, a frequency-</Paragraph>
  </Section>
  <Section position="8" start_page="20" end_page="20" type="metho">
    <SectionTitle>
ARABIC ENGLISH
TOKENIZED IFSA
</SectionTitle>
    <Paragraph position="0"> based approach and a finite state automaton-based approach. We explained that the best of our proposed methods, the improved finite state automaton, has three advantages over the state-of-the-art Arabic word segmentation method (Diab, 2000), supervised learning. They are: consistency in improving the baselines system over different tasks, its capability to be efficiently applied on the large corpora, and its ability to cope with different tasks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML