XML Viewer - w98-1009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/98/w98-1009_ackno.xml
Size: 12,557 bytes
Last Modified: 2025-10-06 13:52:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1009">
  <Title>A Computational Morphology System for Arabic</Title>
  <Section position="2" start_page="0" end_page="69" type="ackno">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a computer system for Arabic morphology that employs a new, faster algorithm to find roots and patterns for verb forms and for nouns and adjectives derived from verbs. The program has been tested on a corpus of 242 abstracts from the Saudi Arabian National Conferences and we are in the process of extending the list of roots to handle a newspaper corpus as well.</Paragraph>
    <Paragraph position="1"> To represent the Arabic character set, we used the Nafitha software developed by 01 system, Manama, Bahrain (Nafitha 1988).</Paragraph>
    <Paragraph position="2"> The morphology system was written with the goal of supporting natural language processing programs such as parsers and information retrieval systems. It is coordinated with a large Arabic lexicon (A1Samara, 1996). It can, however, be used to display whole paradigms for Arabic verbs.</Paragraph>
    <Paragraph position="3"> It can also display a single form, if the user chooses to specify not just the root but the mood, gender, number, and person. It can also analyze any verb form given to it. In addition to 1,116 roots for regular verbs, the system stores forms for the thirty-nine most common irregular verbs. The Arabic word for morphology is &amp;quot;t(a)Sryf' based on the root &amp;quot;Srf', which has a basic idea of changing direction, averting, and flowing freely. &amp;quot;t(a)Sryf&amp;quot; is the total range of morphological patterns used with a given root (Owens, 1988). Here the &amp;quot;S&amp;quot; in the word &amp;quot;T(a)Sryf&amp;quot; stands for the letter&amp;quot;o,,&amp;quot; since there is no corresponding letter in English for this letter.</Paragraph>
    <Paragraph position="4"> We became involved in problems of morphology because we need to find stems and roots for purposes of information retrieval (A1-Kharashi and Evens, 1994; Abu-Salem, 1992; Hmeidi, 1995) and parsing (Abu-Arafah, 1995). The morphology system is coordinated with a large lexicon for Arabic (Hammouri, 1994; A1Samara, 1996).</Paragraph>
    <Paragraph position="5"> The organization of this paper is very straightforward. The next section contains an overview of other approaches to computational morphology. Then we describe our approach to Arabic morphology and its extension to four and five letter roots as well as the three letter roots that are much more common. Finally, we show examples of the output that the program produces when it is used interactively and conclude with plans for future research.</Paragraph>
    <Paragraph position="6">  2. Review of Some Other</Paragraph>
    <Section position="1" start_page="66" end_page="66" type="sub_section">
      <SectionTitle>
Morphology Systems, Systematic
</SectionTitle>
      <Paragraph position="0"> attempts at computational morphology in the West were successful enough by 1992 to lead to the almost simultaneous publication of two major books, Sproat (1992) and Ritchie et al. (1992). At about the same time the PC-Kimmo program became widely available (Antworth, 1992). It had been obvious from the very beginning of Arabic language processing that morphology systems were an absolute necessity, because of the extremely complex morphology that Arabic shares with other Semitic languages.</Paragraph>
      <Paragraph position="1"> Hegazi and EISharkawi (1986) designed a system to detect the root of any Arabic word along with morphological patterns and word categories. Their system has also been used for detection and correction &amp;mistakes in spelling and vowelization.</Paragraph>
      <Paragraph position="2"> Saliba and Ai-Dannan (1989) developed</Paragraph>
    </Section>
    <Section position="2" start_page="66" end_page="69" type="sub_section">
      <SectionTitle>
a Comprehensive Arabic Morphological
Analysis and Generation System at the IBM
</SectionTitle>
      <Paragraph position="0"> Scientific Center in Kuwait. Their analyzer examines the input word for different word types and attempts to find all possible analyses. In the analysis process the longest valid prefix and suffix are stripped from the word and the remaining part of the word, which is called the stem, is used to identify a valid Arabic word. If the stem is accepted as a content word (noun or verb)then further analysis processes will be carried out.</Paragraph>
      <Paragraph position="1"> EI-Sadany and Hashish (1989) developed an Arabic morphological system also designed to carry out both analysis and generation, capable of dealing with vowelized, semivowelized, and nonvowelized Arabic words. This system was developed at the IBM Cairo Scientific Center. The system has the ability to vowelize nonvowelized words. The system was implemented in Prolog on the IBM PS/2 Model 60.</Paragraph>
      <Paragraph position="2"> A1-Fedaghi and A1-Anzi (1989) present an algorithm to generate the root and the pattern of a given Arabic word. The main concept in the algorithm is to locate the position of the three letters of a possible triliteral root in the pattern and check to see whether the candidate trigram appears in a list of known roots.</Paragraph>
      <Paragraph position="3"> When we began to work on the morphology problem ourselves, our first reaction was to start with PC-Kimmo, which we had used in some experiments with much simpler problems in English morphology. But when we communicated with Evan Antworth of the Summer Institute of Linguistics, he discouraged us: &amp;quot;The basic two level mechanism as it is implemented in PC-KIMMO can't easily handle (if at all) the distinctive semitic patterns of consonantal root and intercalated vowels&amp;quot;.</Paragraph>
      <Paragraph position="4"> When we received this message we abandoned our plans to use PC KIMMO and resolved to first extend the E1-Anzi and A1-Fedaghi Algorithm to handle quadriliteral roots and then to look for ways to improve  on it.</Paragraph>
      <Paragraph position="5"> 3. Algorithm to Find Quadriliteral Roots. The first author  designed and implemented an algorithm to find quadriliteral roots and their patterns. This algorithm follows the same strategy as the algorithm of Al-Fedaghi and A1-Anzi (1989).</Paragraph>
      <Paragraph position="6"> Quadriliteral roots are usually formed as extensions oftriliteral roots by reduplicating the final consonant. Thus, the standard triliteral pattern &amp;quot;t91&amp;quot; becomes the quadriliteral pattern &amp;quot;t911.&amp;quot; Here 9 stands for the letter &amp;quot;ayn&amp;quot; since there is no corresponding letter in English for this letter. The other forms of quadriliteral verbs are then obtained by adding affixes to the root. The first step of the algorithm for quadriliteral roots is to search the input form  for a correct pattern. We take a candidate pattern and look for the four letters in the input word (corresponding to f, 9, 1, and 1). If the letters are found we label their positions, posl, pos2, pos3, and pos4. Otherwise, we choose the next candidate pattern and try again. Once we have a match in all four positions we go to the second step.</Paragraph>
      <Paragraph position="7"> The second step is to extract the root from the input word in the positions posl, pos2, pos3, and pos4.</Paragraph>
      <Paragraph position="8"> 4. New Approach to Finding the Root and the Pattern. The algorithm for quadriliteral roots shown in Figure 1 is an extension of the triliteral algorithm of AI-Fedaghi and AI-Anzi (1989). Once wehad implemented it successfully, we were concerned that it was somewhat slow, so we searched for a new approach that would give us the same result. This new approach was then implemented for both triliteral and quadriliteral roots.</Paragraph>
      <Paragraph position="9"> We describe how our approach works for triliteral roots. The first step is to remove the longest possible prefix. Then we look at the remainder. The three letters of the root must lie somewhere in the first four or five characters of the remainder. What is more, the first letter of the remainder is the first letter of the root since we have removed the longest possible prefix.</Paragraph>
      <Paragraph position="10"> We check all possible trigrams within the first five letters of the remainder. That is, we check the following six possible trigrams: * first, second, and third letters first, second, and fourth first, second, and fiRh first, third, and fourth first, third, and fifth first, fourth, and fiRh In order to test the algorithm, we prepared two files: a file of roots and a file of prefixes. The program outputs the root and the pattern for each word in each &amp;the 242 abstracts. Our colleagues in the Arabic Language Processing Laboratory checked all the results for correctness.</Paragraph>
      <Paragraph position="11"> In the abstracts there are 19,167 running words, 16,775 with triliteral roots, and 1,124 with quadriliteral roots, none with quintiliteral roots. The program handles all these correctly. The other 1,268 words are nouns not derived from verbal roots (solid nouns) or borrowings from foreign languages.</Paragraph>
      <Paragraph position="12"> The algorithm requires less space and much less time than the AI-Fedaghi and AI-Anzi algorithm. The average time to search for the roots for all words in an abstract is 2.2 seconds and the average time to search for roots with the A1-Fedaghi and Al-Anzi algorithm is 17.2 seconds. The average length of an abstract is 35 words.</Paragraph>
      <Paragraph position="13"> 5. The Morphology System. The main system menu contains the following options. First, get the various paradigms of the word. This is most often needed by human users and perhaps tutoring programs. Second, get a specific form aRer passing in a word and mood, person, number, and gender. This is most often needed by text generation systems. Third, analyze the input word to get back the part of speech, person, number, and gender. This is most oRen needed by a parser. First, get the root of the input word. This is most often needed by information retrieval systems. The main menu of the system is shown in Figure 2.</Paragraph>
      <Paragraph position="14"> From the main menu the user can select one of the four options. In case the user selects the first option, he/she will get all the information about the input word as seen in Figure 3. When the user selects the second option, the menus in Figures 4, 5, 6, and 7 appear in sequence to select the appropriate codes. Examples &amp;the output in these cases is shown below.</Paragraph>
      <Paragraph position="15">  begin get word for all patterns that have the same length as the input word do begin let pat = pattern locate the positions of the letters f, 9, 1, and I in pat let posl, pos2, pos3, and pos4 be the positions respectively replace the letters in the given word at the positions posl, pos2, pos3, and pos4 with the letters f, 9, 1 and 1 respectively let new-word be the word formed in the previous step if (new-word == pat) then exit the loop</Paragraph>
      <Paragraph position="17"> Given the word C/.~ aider passing the mood = imperfect, number = plural, gender = masculine, and person = 3rd person the form is ~ ,..ak.</Paragraph>
      <Paragraph position="18"> When the user selects the third option he/she will get the following output: the input word ~..xk.</Paragraph>
      <Paragraph position="19"> (~.~ verb 3rd sing mast) When the user selects option four he/she will get the following output: the input word ~.-~.</Paragraph>
      <Paragraph position="20"> the root of~.ak, is ~.~  the first step of most natural language processing applications. We have developed a new algorithm that runs an order of magnitude faster than other algorithms in the literature. We plan to make efforts to extend our system to generate adjectives and generate different types of derived nouns.</Paragraph>
      <Paragraph position="21"> The area of vowelization deserves further research. It is very important in resolving ambiguity in the meaning of the words and the correct pronunciation of the words.</Paragraph>
      <Paragraph position="22"> Vowelizing Arabic text is the process of placing the short vowels above and below Arabic consonants. Our concentration in this project has been on the analysis of nonvowelized text. The next step is to investigate more about this area, in order to build a morphological system that can analyze the vowelized text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML