XML Viewer - w02-0506

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0506_metho.xml
Size: 14,992 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0506">
  <Title>Building a Shallow Arabic Morphological Analyzer in One Day</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The Statistical Approach: Goldsmith
</SectionTitle>
    <Paragraph position="0"> proposed an unsupervised learning automatic morphology tool called AutoMorphology [14].</Paragraph>
    <Paragraph position="1"> This system is advantageous because it learns prefixes, suffixes, and patterns from a corpus or word-list in the target language without any need for human intervention. However, such a system would not be effective in Arabic morphology, because it does not address the issues of infixation, and would not detect uncommon prefixes and suffixes.</Paragraph>
    <Paragraph position="2"> 3. The Hybrid Approach: This approach uses rules in conjunction with statistics. This approach employs a list of prefixes, a list of suffixes, and templates to transform from a stem to a root. Possible prefix-suffix-template combinations are constructed for a word to derive the possible roots. RDI's system called MORPHO3 utilizes such this model [8]. Although such systems achieve broader morphological coverage of the Arabic language, manual derivation of rules is laborious, time-consuming and requires a good knowledge of Arabic orthographic and morphotactic rules. In fact, MORPHO3 was built in 3 man/years [8]. Large-scale morphological analyzers provide more information than just the root of a word.</Paragraph>
    <Paragraph position="3"> They may provide information such as the meaning of prefixes and suffixes and may perform root disambiguation [8] [10] [11].</Paragraph>
    <Paragraph position="4"> However, this paper is concerned with morphological analysis for the purpose of IR.</Paragraph>
    <Paragraph position="5"> Arabic IR is enhanced when the roots are used in indexing and searching [3] [4] [5].</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 System Description
</SectionTitle>
    <Paragraph position="0"> Sebawai, the system discussed here, is similar to the hybrid approach used by RDI's MORPHO3 [8]. However, this system does not require manually constructed lists of rules and affixes.</Paragraph>
    <Paragraph position="1"> Instead, the system replaces the manual processing with automatic processing.</Paragraph>
    <Paragraph position="2"> The system has two main modules. The first utilizes a list of Arabic word-root pairs (1) to derive a list of prefixes and suffixes, (2) to construct stem templates, and (3) to compute the likelihood that a prefix, a suffix, or a template would appear. The second accepts Arabic words as input, attempts to construct possible prefixsuffix-temple combinations, and outputs a ranked list of possible roots.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Getting a list of Word-Root Pairs
</SectionTitle>
      <Paragraph position="0"> The list of word-root pairs may be constructed either manually, using a dictionary, or by using a pre-existing morphological analyzer such as ALPNET or MORPHO3 [8] [10].</Paragraph>
      <Paragraph position="1">  1. Manual construction of word-root pair list: Building the list of several thousand pairs manually is time consuming, but feasible.</Paragraph>
      <Paragraph position="2"> Assuming that a person who knows Arabic can generate a root for a word every 5 seconds, the manual process would require about 14 hours of work to produce 10,000 word-root pairs.</Paragraph>
      <Paragraph position="3"> 2. Automatic construction of a list using dictionary parsing: Extracting word-root pairs from an electronic dictionary is a feasible process. Since Arabic words are looked up in a dictionary using their root form, an electronic dictionary such as Lisan Al-Arab may be parsed to generate the desired list. However, some care should be given to throw away dictionary examples and words unrelated to the root.</Paragraph>
      <Paragraph position="4"> 3. Automatic construction using a pre-existing morphological analyzer: This process is simple,  but requires the availability of an analyzer. For the purposes of this paper, the third method was used to construct the list. Two lists of Arabic words were fed to ALPNET (which was the only Arabic morphological analyzer available to the author) and then the output was parsed to generate the word-root pairs. One list was extracted from a corpus of traditional Arabic text, called Zad, owned by Al-Areeb Electronic Publishers [15].</Paragraph>
      <Paragraph position="5"> The list contains 9,606 words that ALPNET was able to analyze successfully. The original list was larger, but the words that ALPNET was unable to analyze were excluded. The other list was extracted from the LDC Arabic collection (LDC2001T55) containing AFP news-wire stories [16]. This list contains 560,000 words. Of the 560,000 words, ALPNET was able to analyze 270,000 words successfully. The rest of the words (about 290,000) were used for evaluating Sebawai.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Training
</SectionTitle>
      <Paragraph position="0"> As stated above, this module takes a word-root pair as input. By comparing the word to the root, the system determines the prefix, suffix, and stem template. For example, given the pair ( a1 a3a49a89a90a23a91 a5 &amp;quot;wktAbhm&amp;quot;, a0a2a1 a3 &amp;quot;ktb&amp;quot;), the system generates a89 &amp;quot;w&amp;quot; as the prefix, a90a93a92 &amp;quot;hm&amp;quot; as the suffix, and a94a50a5a95a58a96 &amp;quot;CCAC&amp;quot; as the stem template (C's represent the letters in the root). The system increases the number of occurrences of the prefix a89 &amp;quot;w&amp;quot;, the suffix a90a23a92 &amp;quot;hm&amp;quot;, and the template &amp;quot;CCAC&amp;quot; by one. The system takes into account the cases where there are no prefixes or suffixes and denotes either of them with the symbol &amp;quot;#&amp;quot;.</Paragraph>
      <Paragraph position="1">  After that, the lists of prefixes, suffixes, and templates are read through to assign probabilities to items on the lists by dividing the occurrence of each item in each list by the total number of words. The probabilities being calculated are given for character strings S1 and S2 and template</Paragraph>
      <Paragraph position="3"> Another potential way of calculating the probabilities of prefixes and suffixes is to use the conditional probabilities that the item appears in the word and is actually a prefix or suffix. For example, if a89 &amp;quot;w&amp;quot; appeared as the first letter in the word 100 times, 70 times of which it was actually a prefix, then the probability would be .70. In other words, the probabilities being calculated are given for character strings S1 and S2 as: P(S1 is a prefix  |S1 begins a word) P(S2 is a suffix  |S2 ends a word) Notice that Sebawai's stems are slightly different from standard stems. Standard stem templates may have letters added in the middle and in the beginning. For example the template a94a108a107a109a95a58a110a77a111 &amp;quot;mCCwC&amp;quot; has a112 &amp;quot;m&amp;quot; placed before the root and a89 &amp;quot;w&amp;quot; placed in the middle. Both a112 &amp;quot;m&amp;quot; and a89 &amp;quot;w&amp;quot; are a part of the stem template. However, the training module has no prior knowledge of standard stem templates. Therefore, for the template a94a108a107a109a95a58a110a77a111 &amp;quot;mCCwC&amp;quot;, a112 &amp;quot;m&amp;quot; is simply treated as a part of the prefix list and the extracted template is a94a85a107a93a95a77a96 &amp;quot;CCwC&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Root Detection
</SectionTitle>
      <Paragraph position="0"> The detect-root module accepts an Arabic word and attempts to generate prefix-suffix-template combinations. The combinations are produced by progressively removing prefixes and suffixes and then trying matching all the produced stems to a template. For example, for the Arabic word a4a6a5a7a109a8 &amp;quot;AymAn&amp;quot; the possible prefixes are &amp;quot;#&amp;quot;, a8 &amp;quot;A&amp;quot;, and a113 a8 &amp;quot;Ay&amp;quot;, and the possible suffixes are &amp;quot;#&amp;quot;, a4 &amp;quot;n&amp;quot;, and a4a104a8 &amp;quot;An&amp;quot;.</Paragraph>
      <Paragraph position="1"> The resulting feasible stems are:  The ones that the system deemed as not feasible are a5a7a9a8 &amp;quot;AymA&amp;quot; and a124 &amp;quot;ym&amp;quot;. Although a5a7a9a8 &amp;quot;AymA&amp;quot; is not feasible, a124 &amp;quot;ym&amp;quot; is actually feasible (comes from the root a90 a7 &amp;quot;ymm&amp;quot;), but the system did not know how to deal with it. The paper will address this problem in the next sub-section. The possible roots are ordered according to the product of the probability that a prefix S1 would be observed, the probability that a suffix S2 would be observed, and the probability that a template T would be used.</Paragraph>
      <Paragraph position="3"> The probabilities of stems, suffixes, and templates are assumed to be independent. The independence assumption is made to simplify the ranking, but is not necessarily a correct assumption because certain prefix-suffix combinations are not allowed. Using the system requires some smoothing which will be discussed in the next subsection. The generated roots are compared to a list of 10,000 roots extracted automatically from an electronic copy of Lisan al-Arab to verify their existence in the language [7].</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Missed or Erroneous Roots
</SectionTitle>
      <Paragraph position="0"> As seen above, the system deemed the stem a124 &amp;quot;ym&amp;quot; not feasible, while in actuality the stem maps to the root a90 a7 &amp;quot;ymm&amp;quot;. Other cases where the system failed were when the root had weak letters. Weak letters are a8 &amp;quot;A&amp;quot;, a113 &amp;quot;y&amp;quot;, and a89 &amp;quot;w&amp;quot;. The weak letters are frequently substituted for each other in stem form or dropped all together.</Paragraph>
      <Paragraph position="1"> For example, the word a94a6a5a125 &amp;quot;qAl&amp;quot; has the root a94a85a107a109a125 &amp;quot;qwl&amp;quot; or a126a104a127 a125 &amp;quot;qyl&amp;quot; which would make the word mean 'he said' or 'he napped' respectively. Also, the word a128a129 &amp;quot;f&amp;quot; has the root a130 a89 &amp;quot;wfy&amp;quot; where the letters a89 &amp;quot;w&amp;quot; and a113 &amp;quot;y&amp;quot; are missing. To compensate for these problems, two letter stems were corrected by introducing new stems that are generated by doubling the last letter (to produce a90 a7 &amp;quot;ymm&amp;quot; from a124 &amp;quot;ym&amp;quot;) and by adding weak letters before or after the stem. As for stems with a weak middle letter, new stems are introduced by substituting the middle letter with the other weak letters. For example, for a94a131a5a125 &amp;quot;qAl&amp;quot;, the system would introduce the stems a94a104a107a109a125 &amp;quot;qwl&amp;quot; and a126a104a127 a125 &amp;quot;qyl&amp;quot;. This process over-generates potential roots. For example, from the three potential roots a94a50a5a125 &amp;quot;qAl&amp;quot;, a94a70a107a109a125 &amp;quot;qwl&amp;quot;, and a126a104a127 a125 &amp;quot;qyl&amp;quot;, a94a6a5a125 &amp;quot;qAl&amp;quot; is not a valid root and is thus removed (by comparing to the list of valid roots). To account for the changes, the following probabilities were calculated: (a) the probability that a weak letter w1 would be transformed to another weak letter w2, (b) the probability that a two letter word would have a root with the second letter doubled (such as a90 a7 &amp;quot;ymm&amp;quot;), and (c) the probability that a two letter word was derived from a root by dropping an initial or trailing weak letter. The new probability of the root becomes:</Paragraph>
      <Paragraph position="3"> As for smoothing the prefix and suffix probabilities, Witten-Bell discounting was used [17]. The smoothing is necessary because many prefixes and suffixes were erroneously produced.</Paragraph>
      <Paragraph position="4"> This is a result of word-root pair errors. Using this smoothing strategy, if a prefix or a suffix is observed only once, then it is removed from the respective list. As for the list of templates, it was reviewed by an Arabic speaker (the author of the paper) to insure the correctness of the templates.</Paragraph>
      <Paragraph position="5"> The Arabic examiner was aided by example words the system provided for each template. If a template was deemed not correct, it was removed from the list.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Particles
</SectionTitle>
      <Paragraph position="0"> To account for particles, a list of Arabic particles was constructed with aid of An-Nahw Ash-Shamil (an Arabic grammar book) [6]. If the system matched a potential stem to one of the words on the particle list, the system would indicate that the word is a particle. Note that particles are allowed to have suffixes and prefixes. A complete list of the particles used by Sebawai is available upon request.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 Letter Normalizations
</SectionTitle>
      <Paragraph position="0"> The system employs a letter normalization strategy in order to account for spelling variations and to ease in the deduction of roots from words. The first normalization deals with the letters a113 &amp;quot;y&amp;quot; and a132 &amp;quot;Y&amp;quot; (alef maqsoura). Both are normalized to a113 &amp;quot;y&amp;quot;. The reason behind this normalization is that there is no one convention for spelling a113 &amp;quot;y&amp;quot; or a132 &amp;quot;Y&amp;quot; when either appears at the end of a word (Note that a132 &amp;quot;Y&amp;quot; only appears at the end of a word). In the Othmani script of the Holy Qur'an for example, any a113 &amp;quot;y&amp;quot; is written as a132 &amp;quot;Y&amp;quot; when it appears at the end of a word [18]. The second normalization is that of &amp;quot;a133 &amp;quot; (hamza), &amp;quot;a134 &amp;quot; (alef maad), &amp;quot;a135&amp;quot; (alef with hamza on top), &amp;quot;a136 &amp;quot; (hamza on w), &amp;quot;a137 &amp;quot; (alef with hamza on the bottom), and &amp;quot;a138 &amp;quot; (hamza on ya). The reason for this normalization is that all forms of hamza are represented in dictionaries as one in root form namely &amp;quot;a133 &amp;quot; or &amp;quot;a135&amp;quot;, depending on the dictionary, and people often misspell different forms of alef. All are normalized to the symbol a8 &amp;quot;A&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML