File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1007_metho.xml
Size: 21,969 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1007"> <Title>Arabic Morphology Using Only Finite-State Operations</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In Arabic, as in other natural languages, the two challenges of morphological analysis are the description of 1) the morphotactics and 2) the variation rules. Morphotactics is the study of how morphemes combine together to make well-formed words. Variations are the discrepancies between the underlying or morphophonemic strings and their surface realization, which are either phonological or orthographical strings depending on the purpose of the grammar.</Paragraph> <Paragraph position="1"> The key insight and claim of the finite-state approach to morphology (Karttunen, 1991; Karttunen et al., 1992; Karttunen, 1994)is that both morphotactics and variation grammars can be written as regular expressions, which are compiled and implemented on computers as finite-state automata. Such grammars are interesting theoretically because they are highly constrained; and in practical computational linguistics for natural languages, finite-state automata are fast, usually compact in size, bidirectional, combinable using all valid finite-state operations, and consultable using language-independent lookup code.</Paragraph> <Paragraph position="2"> Finite-state approaches to morphology, including the readily available implementation known as Two-Level Morphology (Koskenniemi, 1983; Antworth, 1990), have been shown to work in significant projects for French, English, Spanish, Portuguese, Italian, Finnish, Turkish and a wide variety of other natural languages.</Paragraph> <Paragraph position="3"> But despite the high attractiveness of finite-state computing, many investigators have concluded that finite-state techniques are not adequate for describing Semitic root-and-pattern morphology. This paper will present the case that fully implemented finite-state morphology can be and has been used successfully for Arabic. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Regular Expressions </SectionTitle> <Paragraph position="0"> When writing a finite-state morphological grammar, linguists state morphotactic and variation rules in the metalanguage of regular expressions or in higher-level languages that are convenient shorthand notations for complex regular expressions.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 2.1 Regular Expressions, Regular </SectionTitle> <Paragraph position="0"/> </Section> <Section position="6" start_page="0" end_page="50" type="metho"> <SectionTitle> Transducers </SectionTitle> <Paragraph position="0"> A regular expression that contains an alphabet of one-level symbols defines a regular language and compiles into a finite-state machine (FSM) that accepts this regular language. A regular expression that contains an alphabet of paired symbols defines a regular relation (a relation between two regular languages) and compiles into a finite-state transducer (FST) that maps from every string of one language into strings of the other. H the necessary finite-state algorithms and compilers are available, components of the grammar, including various sublexicons and rules, can be compiled into separate transducers and then combined together using any operations that are mathematically valid.</Paragraph> <Paragraph position="1"> The Xerox implementation of finite-state morphology includes a complete range of fundamental algorithms (concatenation, union, intersection, complementation, etc.) plus higher-level shorthand languages such as lexc (Karttunen, 1993), twolc (Karttunen and Beesley, 1992) and Replace Rules (Karttunen, 1995; Karttunen and Kempe, 1995; Karttunen, 1996).</Paragraph> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 2.2 Finite-State Operations </SectionTitle> <Paragraph position="0"> When defining morphotactics or variations via regular expressions, the linguist has access to all the operations that are mathematically valid on regular languages and relations. The following is a brief outline of regular expressions in the Xerox notation: For each symbol s, the regular expression s denotes a regular language consisting of the single string &quot;s&quot;. If A and B are regular languages, then the regular expressions in Figure 1 also denote regular languages. The cross-product of A and B, denoted A .x. B, relates each string in A, the upper language, to every string of B, the lower language, and vice versa. A .x. B thus denotes a regular relation. Where u and 1 are symbols, u:l is a notation equivalent to u .x. 1.</Paragraph> <Paragraph position="1"> For formal reasons, relations are not quite as manipulable as simple languages; in particular, relations are closed under concatenation, union, and iteration, but not under intersection, subtraction or complementation.</Paragraph> <Paragraph position="2"> Relations are closed under composition, a somewhat more difficult operation to conceptualize. Let A, B and C denote regular languages; let X denote a regular relation between an upper-side language A and a lower-side language B; and let Y denote a regular relation between the upper-side language B and a lower-side language C. Then the composition of Y under X, denoted X .o. Y, denotes a regular relation Z that maps directly between languages A and C; the intermediate language B disappears in the process of composition.</Paragraph> <Paragraph position="3"> In defining natural-language morphotactics, union and concatenation are the basic operations required. Variation rules and longdistance-dependency filters are applied using composition. And we shall illustrate below how Arabic root-and-pattern interdigitation can be performed via intersection and composition.</Paragraph> </Section> </Section> <Section position="7" start_page="50" end_page="54" type="metho"> <SectionTitle> 3 Regular-Expression Grammars </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.1 Concatenative Morphotactics </SectionTitle> <Paragraph position="0"> Individual morphemes of natural language typically consist of one or more symbols, simply concatenated together. Thus the English morphemes s, ed and ing represent the concatenations \[s\], \[e d\] and \[i n g\] respectively. Where 0 represents e (the zero-length string), the set of regular verb suffixes of English can be represented as the union \[Is\] I\[e d\] \[ \[i n g\] I 0\]. The set of verb stems taking these endings includes wreck, walk, and talk, which can also be formalized using concatenation and union: \[\[w r e c k\] I \[w a 1 k\] \] It a 1 k\]\]. The union of endings can then be concatenated on the end of the union of verb stems to form a larger expression that denotes a language that looks like a subset of English verbs: \[\[w r e e k\] I \[w a 1 k\]l \[t a l k\]\] \[\[s\]l \[e d\] I\[i n g\]l 0\].</Paragraph> <Paragraph position="1"> If the linguist defines the symbols +Verb, +3PS (for &quot;third personal singular&quot;), +Past, +PrPart (for &quot;present participle&quot;) and +Bare, the following expression denotes the relation that maps lower-side (surface) string like &quot;talks&quot; to the upper-side string &quot;talk+Verb+3PS&quot;, and vice-versa. The preceding plus signs of these &quot;tag&quot; symbols are included simply to improve the human readability of the resulting strings; because the plus sign is normally a special Kleene Plus symbol in regular expressions, it is literalized in the examples below with a preceding percent sign.</Paragraph> <Paragraph position="2"> \[\[w:w r:r e:e c:c k:k\] I \[w:w a:a hi k:k\]</Paragraph> <Paragraph position="4"> By convention in Xerox regular expressions denoting relations, the relation s:s can be written simply as s, as in the following:</Paragraph> <Paragraph position="6"> The English-verb fragment shown here was carefully chosen to be simple. However, there are three classes of phenomena for which union and concatenation, by themselves, are general\])' inadequate or at least very inconvenient for describing all and only the strings that appear in a natural language: the zero-length string (often called E) bracketing; denotes the same language as A the concatenation of B after A the union of A and B the intersection of A and B optionality, equivalent to \[ A I 0 \] Kleene star iteration, zero or more concatenations of A equivalent to \[ A A* \]) i.e. one or more concatenations of A the regular language A, ignoring any instances of B any symbol, i.e. the union of all single-symbol strings language A, minus all strings in language B equivalent to \[? - B\], the union of all single-symbol strings minus the strings in B equivalent to \[?* - B\], the complement of B Figure h Some Finite-State Notations</Paragraph> <Paragraph position="8"> Discontiguous dependencies between morphemes in a word, Non-concatenative morphotactic processes such as reduplication and Semitic interdigitation, and Variations, typically assimilations, deletions and epentheses, that map between the abstract morphophonemic strings and their correct surface realizations.</Paragraph> <Paragraph position="9"> We continue with illustrations of how such phenomena can be handled in a finite-state grammar. null</Paragraph> </Section> <Section position="2" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 3.2 Discontiguous Dependencies </SectionTitle> <Paragraph position="0"> To illustrate discontiguous dependencies, let us ignore for a second the internal structure of Arabic stems and postulate a set of noun stems including kaatib (&quot;scribe&quot;), kitaab (&quot;book&quot;), and daaris (&quot;student&quot;), formalized as \[\[k a a t i b\] \] \[k i t a a b\] I \[d a a r i s\]\]. The set of possible case endings includes the definite set u (nominative), a (accusative) and i (genitive) as well as the indefinite set un (nominative), an (accusative) and in (genitive). 1 The most straightforward way to proceed to describe the morphotactics of a fragment of Arabic nouns is to concatenate the possible case endings onto the noun stems. Informative multicharacter pronunciation. Orthographically, the indefinite case endings consist of single symbols that are distinct from the single symbols used for definite endings.</Paragraph> <Paragraph position="2"> def (for &quot;indefinite&quot;) and +Nom, +Acc and +Gen are defined for the upper-side language.</Paragraph> <Paragraph position="3"> Ilk a a t i b\] t \[k i t a a b\] I \[da a</Paragraph> <Paragraph position="5"> The resulting relation includes pairs of strings like Upper: kaatib+Noun+Indef+Acc Lower: kaatiban Arabic nouns can also have a prefixed definite article, which we will represent as l, and prefixed prepositions like bi. Both are optional, and if bi and l cooccur, then bi must come first. The most straightforward way to allow these prefixes is to concatenate them on the front of the regular expression as in Figure 2. Prep+ and Art+ are interpreted as multicharacter symbols, and the parentheses indicate optionality, as shown in Figure 1.</Paragraph> <Paragraph position="6"> However, Arabic words with a prefixed definite article l are in fact precluded from taking indefinite case suffixes. And words with a prefixed bi are compatible only with genitive case suffixes. The expression, as written in Figure 2, overgenerates, producing ill-formed string pairs like the following: It is possible to rewrite the regular expression in various ways to eliminate the overgeneration, but this is tedious and dangerous, requiring the making and subsequent parallel maintenance of multiple copies of the noun stems. In practice, it is much more convenient to let the core lexicon overgenerate and subsequently filter out the bad strings, either at compile time or at runtime. The most straightforward method is to remove the ill-fonued strings via composition of finite-state filters. Starting with the overgenerating grammar of Figure 2, one set of illegal strings to be eliminated contains both the Art+ and the +Indef symbols on the upper side. We can characterize these illegal strings in a regular expression: The union of these two expressions characterizes the ill-formed upper-side strings to be eliminated, and the complement (notated &quot;) of that union denotes the good strings.</Paragraph> <Paragraph position="8"> When this &quot;filter&quot; expression is composed on top of the overgenerating lexicon transducer, only the legal strings are matched, and the illegal strings are in fact eliminated from the result, which is again a finite-state transducer. There are several variations of this method that produce the same effect (Beesley, 1998d), with different penalties in the size of the resulting transducer or in the performance; but in the end the constraint of discontiguous dependencies is easily accomplished using finite-state techniques.</Paragraph> </Section> <Section position="3" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 3.3 Non-Coneatenative Morphotactics </SectionTitle> <Paragraph position="0"> While the morphotactic structure of many natural languages can be satisfactorily described using just concatenation, perhaps with subsequent filtering to constrain discontiguous dependencies, there are other languages with morphotactic phenomena that are notoriously nonconcatenative, in particular reduplication, infixation and Semitic stem interdigitation (also known as intercalation). We will concentrate on Arabic here, arguing that roots, patterns and vocalizations can be formalized as regular expressions denoting regular languages, and that stems are formed by the intersection of these regular languages.</Paragraph> <Paragraph position="1"> For illustration, let us assume, following the influential McCarthy (1981) analysis fairly closely, that Arabic stems consist of a root like ktb, a consonant-vowel template such as CVCVC, and a vocalization like ui. Where McCarthy proposed an extension of autosegmental theory, placing each of these morphemes on a separate tier, and proposing &quot;association rules&quot; to combine and linearize them into the stem kutib, we propose to formalize the same data in purely finite-state terms.</Paragraph> <Paragraph position="2"> Let each root like ktb be formalized as \[k t b\]/7, i.e. as the language consisting of all strings containing k, t and b, in that order, ignoring the presence of any other symbols. (The notation \[k t b\]/? is equivalent to \[7&quot; k 7&quot; t 7&quot; b 7*\].) Let C denote the union of all radical consonants, and let V denote \[a \] i \] u\], the union of all vowels. CV templates are defined as concatenations of Cs and Vs. Using the Xerox xfst interface, these definitions can be computed as define ktb \[k t b\]/? ; define drs \[d r s\]/? ; define C \[ k ~ t ~ b \[ d m r m s \] ; define V \[ a I i \] u \] ; define FormI \[ C V C V C \] ; define FormII \[ C V C X V C \] ; define formIII \[ C V V C V C \] ; Vocalizations are also defined as regular expressions denoting regular languages, e.g. Perfect Active as \[a*\]/\V, the set of all strings containing zero or more as, ignoring all other symbols except vowels. Other vocalizations are defined similarly: Given the definitions above, xfst will evaluate the expressions on the left below, indicating the intersection of a root, a pattern and a vocalization, and return a language consisting of the single string on the right, an interdigitated but still morphophonemic stem (Beesley, 1998a).</Paragraph> <Paragraph position="4"> gemination (or lengthening) of the previous consonant, and its realization is controlled by variation rules. Consonant spreading, as in Form IX and Form XII, and biliteral roots also use the morphophonemic X symbol (Beesley, 1998c).</Paragraph> <Paragraph position="5"> Form I vocalizations are in fact idiosyncratic for each root, and those for the Imperfect Active are more troublesome, but the same kind of formalism applies. 2 If patterns are allowed to contain non-radical consonants, as in the analyses of Harris (1941) and Hudson (1986), then the definitions must be complicated slightly to prevent radicals from intersecting with the non-radical consonants (Beesley, 1998b). For a different formalization of this and other models proposed by McCarthy, but using techniques that go beyond finite-state power, see Kiraz (1996).</Paragraph> </Section> <Section position="4" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 3.4 Defining Variation Rules </SectionTitle> <Paragraph position="0"> When underlying morphemes are concatenated and intersected together, the resulting strings 2The Form I perfect active stem vowel for ktb happens to be /a/, so the general PerfectActive vocalization \[a*\]/\V works in this case; other roots will require \[a i\]/\V or \[a u\]/\V. For the Imperfect Passive, the vocalization is \[u a*\]/kV for all forms. For the Imperfect Active, the least attractive case for vowel abstraction, the Form I roweling is \[a*\]/\V, \[a i\]/\V or \[a u\]/\V, depending on the root; the Form II through IV voweling is \[u a* i\]/\V; the Form V and VI voweling is \[a*\]/kV; and the remaining forms VII to XV use \[a* i\]/\V. If such generalization of vocalization appears tenuous, the alternative is simply to keep the vowels in the patterns, resulting in a two-way intersection of roots and patterns (Harris, 1941; Kataja and Koskenniemi, 1988).</Paragraph> <Paragraph position="1"> are often still very abstract or morphophonemic; there may be many phonological or orthographical variations between these morphophonemic strings and their ultimate surface pronunciation or spelling. For example, English nouns usually pluralize by taking an s suffix, as in book~books, but words like fly pluralize as flies rather than *flys. The variation between underlying y and the surface ie can be defined in terms of two-level rules or Replace Rules, which partially mimic traditional rewrite rules in their superficial syntax (Chomsky and Halle, 1968). Johnson (1972) demonstrated that rewrite-rules, as used by linguists, had only finite-state power and could be implemented as finite-state transducers; this important result, unfortunately overlooked at the time and later rediscovered by Kaplan and Kay (1981) (see also Kaplan and Kay (1994)) is a key mathematical foundation for finite-state morphology and phonology.</Paragraph> <Paragraph position="2"> The variation rules required for Arabic were relatively difficult to write, but they are not different in kind or power from the rules required for other languages. The most difficult challenges involve the so-called weak roots, those containing a w (.~), y (~) or hamza (glottal stop) as one of the radicals.</Paragraph> <Paragraph position="3"> Via concatenation and intersection, the lexicon produces morphophonemic strings like katab-Fa, the Form I perfect active of ktb, with a masculine singular Wa suffix; similarly for daras-{-a, based on drs. These particular strings are very surfacy already, being realized in their fully-voweled form as kataba, rendered as (.~, and darasa, rendered as (~,~.~). When trivial &quot;relaxation&quot; rules are composed on the bottom of the lexicon, allowing optional deletion of the short vowels, the system is also able to analyze the surface forms ktb (.,J) and drs (~r,)#) and all the other partially voweled variations. null With weak roots, however, such as the finally weak bny, the dictionary generates parallel morphophonemic forms like banay-{-a, but the surface form is properly spelled with a ylike 'alif maqs.uura, ~., rather than with a normal y with two dots (~. is not a possible spelling for underlying banay+a). This or- null thographical change reflects the fact that the word is pronounced /banal/ rather than /banaja/. The perfect passive buniy%a, however, is still spelled as bny (~.), reflecting a pronunciation of/bunija/, although in Egyptian orthographical practice the dots are usually dropped here as well, yielding ~. again. With the feminine ending, banay-fat, the underlying y disappears completely, both phonologically and orthographically, yielding surface bnt (,:~.).</Paragraph> <Paragraph position="4"> With a medially-weak root like qwl, the morphophonemic Form I perfect active qawul-t-a gets realized as qAl (J~), reflecting the pronunciation /qalla/. When the suffix begins with a consonant, as in qawuld-ta, the surface spelling is qlt, reflecting the pronunciation/qulta/. An initially weak example like taWwlidJcu, based on root wld, yields .~, with the deletion of the initial radical w, while tud-wlad%u, with an initial tud- prefix, yields aJ~ with the w intact. Similarly for root w'd, but with hamza complications: yad-w'id-l-u yields a~. while yu%w'ad-{-u yields ~2-&quot; The rule writer must also handle a number of assimilations, as in the Form VIII of root 5kr, underlying 8takar-{-a, which is pronounced /Piddakara/ and written accordingly, including diacritics for clarity, as &quot;~!. Similary, for roots with an initial pharyngealized saad (~,,) or d. aad (~j,) radical, such as .drb, the underlying Form VIII is .dtarab-{-a, emerging with the infixed Form VIII t assimilating to its pharyngealized version t. in ~.'~!. None of these phenomena is phonologically surprising; local assimilations and contextual instabilities in semiconsonants like/w/and/y/are garden-variety variations, elegantly handled with finite-state variation rules.</Paragraph> </Section> </Section> class="xml-element"></Paper>