File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/ackno/00/p00-1026_ackno.xml
Size: 5,155 bytes
Last Modified: 2025-10-06 13:50:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1026"> <Title>A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots</Title> <Section position="9" start_page="0" end_page="0" type="ackno"> <SectionTitle> Acknowledgements </SectionTitle> <Paragraph position="0"> Our thanks go to the Kuwait State's Public Authority for Applied Education and Training, for the supporting research studentship, and to two anonymous referees for detailed, interesting and constructive comments.</Paragraph> <Paragraph position="1"> Appendix - Arabic in a Nutshell The vast majority of Arabic words are derived from 3 (and a few 4) letter roots via a complex morphology. Roots give rise to stems by the application of a set of fixed patterns. Addition of affixes to stems yields words.</Paragraph> <Paragraph position="2"> Table 9 shows examples of stem derivation from 3-letter roots. Stem patterns are formulated as variations on the characters f?L (pronounced as f'l - ? is the symbol for ayn , a strong glottal stop), where each of the successive consonants matches a character in the bare root (for ktb, k matches f, t matches ? and b matches L). Stems follow the pattern as directed. As the examples show, each pattern has a specific effect on meaning. Several hundred patterns exist, but on average only about 18 are applicable to each root (Beesley 1998).</Paragraph> <Paragraph position="3"> The language distinguishes between long and short vowels. Short vowels affect meaning, but are not normally written. However, patterns may involve short vowels, and the effects of some patterns are indistinguishable in written text. Readers must infer the intended meaning.</Paragraph> <Paragraph position="4"> Affixes may be added to the word, either under derivation, or to mark grammatical function. For instance, walktab breaks down as w ( and ) + al ( the ) + ktab ( writers , or book, depending on the voweling) . Other affixes function as person, number, gender and tense markers, subject and direct object pronouns, articles, conjunctions and prepositions, though some of these may also occur as separate words (eg wal ( and the )).</Paragraph> <Paragraph position="5"> Arabic morphology presents some tricky NLP problems. Stem patterns &quot;interdigitate&quot; with root consonants, which is difficult to parse. Also, the long vowels a ( alif ), w ( waw ) and y ( ya ) can occur as root consonants, in which case they are considered to be weak letters, and the root a weak root. Under certain circumstances, weak letters may change shape (eg waw into ya ) or disappear during derivation. Long vowels also occur as affixes, so identifying them as affix or root consonant is often problematic.</Paragraph> <Paragraph position="6"> The language makes heavy use of infixes as well as prefixes and suffixes, all of which may be consonants or long vowels. Apart from breaking up root letter sequences (which tend to be short), infixes are easily confused with root consonants, whether weak or not. The problem for affix detection can be stated as follows: weak root consonants are easily confused with long vowel affixes; consonant affixes are easily confused with non-weak letter root consonants.</Paragraph> <Paragraph position="7"> Erroneus stripping of affixes will yield the wrong root.</Paragraph> <Paragraph position="8"> Arabic plurals are difficult. The d ual and some plurals are formed by suffixes, in which case they are called external plurals. The broken, or internal plural, however, changes the internal structure of the word according to a set of patterns. To illustrate the complexity, masculine plurals take a -wn or -yn suffix, as in mhnds ( engineer ), mhndswn. Female plurals add the -at suffix, or change word final -h to -at, as in mdrsh ( teacher ), mdrsat. Broken plurals affect root characters, as in mal ( fund from root mwl), amwal, or wSL ( link from root wSL), 'aySaL.</Paragraph> <Paragraph position="9"> The examples are rife with long vowels (weak letters?). They illustrate the degree of interference between broken plural patterns and other ways of segmenting words.</Paragraph> <Paragraph position="10"> Regional spelling conventions are common: eg. three versions of word initial alif occur. The most prominent orthographic problem is the behaviour of hamza , ( '), a sign written over a carrier letter and sounding a lenis glottal stop (not to be confused with ayn ). Hamza is not always pronounced. Like any other consonant, it can take a vowel, long or short. In word initial position it is always carried by alif , but may be written above or below, or omitted. Mid-word it is often carried by one of the long vowels, depending on rules whose complexity often gives rise to spelling errors. At the end of words, it may be carried or written independently.</Paragraph> <Paragraph position="11"> Hamza is used both as a root consonant and an affix, and is subject to the same problems as non-weak letter consonants, compounded by unpredictable orthography: identical words may have differently positioned hamzas and would be considered as different strings.</Paragraph> </Section> class="xml-element"></Paper>