File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-1006_metho.xml

Size: 44,203 bytes

Last Modified: 2025-10-06 14:07:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-1006">
  <Title>Multitiered Nonlinear Morphology Using Multitape Finite Automata: A Case Study on Syriac and Arabic</Title>
  <Section position="5" start_page="81" end_page="13121" type="metho">
    <SectionTitle>
LEX { =G -- }
SURF
</SectionTitle>
    <Paragraph position="0"> Further, capital-initial expressions are variables over predefined finite sets of symbols.</Paragraph>
    <Paragraph position="1"> The operator ~ is the optional operator. It states that LEX may surface as SURF in the given context, but may surface otherwise if sanctioned by another rule. The operator - adds obligatory constraints: when LEX appears in the given context, then the surface description must satisfy SURF. A lexical string maps to a surface string if and only if they can be partitioned into pairs of lexical-surface subsequences, where (i) each pair is licensed by a ~ rule, and (ii) no sequence of zero or more adjacent pairs violates a - rule. The interpretation of the latter condition is based on Grimley-Evans, Kiraz, and Pulman (1996). (See Kiraz \[in press\] for the historical development of the formalism.) Several extensions are introduced into the formalism to handle multitiered representations. Expressions on the upper lexical side (LLC, LEx, and RLC) are tuples of strings of the form {Xl, x2,..., Xn-1}. The ith element in the tuple refers to symbols in the ith sublexicon of the lexical component. When a lexical expression makes use of</Paragraph>
    <Paragraph position="3"> Rules for the derivation of Syriac/ktab/. R1 and R2 sanction root consonants and vowels, respectively, while R3 handles vowel deletion.</Paragraph>
    <Paragraph position="4"> \]a a vocalism k j t, b root c iv c v c pallern  k J ti b tool ? e, t c ,vic Iv c lmllern. ~; affi,res 00013121 \[? eit, k it, i.</Paragraph>
    <Paragraph position="5"> (b)  Lexical-surface analysis of Syriac/ktab/and/?etktab/. Vocalic spreading is ignored in this example (see Section 5.1).</Paragraph>
    <Paragraph position="6"> only the first sublexicon, the angle brackets can be ignored. Hence, the LEX expression (x, e ..... e) and x are equivalent; in lexical contexts, (x,, .... , *) and x are equivalent. Additionally, the symbol &amp;quot;*&amp;quot; now denotes Kleene star as applied to the alphabet of the respective tier.</Paragraph>
    <Paragraph position="7"> The formalism is illustrated in Figure 3. The rules derive Syriac/ktab/(underlying */katab/) from the pattern morpheme {cvcvc} 'verbal Measure 1', the root morpheme {ktb} 'notion of writing', and the vocalism morpheme {aa} 'PERF ACT' (ignoring spreading for the moment). Rule R1 sanctions root consonants by mapping a \[c\] from the first (pattern) sublexicon, a consonant \[X\] from the second (root) sublexicon, and no symbol from the third (vocalism) sublexicon to surface \[X\]. Rule R2 sanctions vowels in a similar manner. The obligatory rule R3 deletes the first vowel of */katab/in the given context. The mapping is illustrated in Figure 4(a). The numbers between the surface and lexical expressions indicate the rules in Figure 3 that sanction the shown subsequences. Empty slots represent the empty string C/.</Paragraph>
    <Paragraph position="8"> As stated above, morphemes that do not conform to the root-and-pattern nature of Semitic (e.g., prefixes, suffixes, particles) are given in the first sublexicon. The identity rule:</Paragraph>
    <Paragraph position="10"> maps such morphemes to the surface. The rule basically states that any symbol not  Sing. masc. he-Sing. fern. te-P1. masc. ne-~n P1. fern. te-an in { c,v } from the first sublexicon may optionally surface. Figure 4(b) illustrates the analysis of/?etkatab/from the morphemes given earlier.</Paragraph>
    <Section position="1" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
3.3 The Morphotactics Component
</SectionTitle>
      <Paragraph position="0"> Semitic morphotactics is divided into two categories: Templatic morphotactics occurs when the pattern, root, vocalism, and possibly other morphemes, join together in a nonlinear manner to form a stem. Non-templatic morphotactics takes place when the stem is combined with other morphemes to form larger morphological or syntactic units. The latter is divided in turn into two types: linear nontemplatic morphotactics, which makes use of simple prefixation and suffixation, and nonlinear nontemplatic morphotactics, which makes use of circumfixation.</Paragraph>
      <Paragraph position="1"> Templatic morphotactics is handled implicitly by the rewrite rules component. For example, the rules in Figure 3 implicitly dictate the manner in which pattern, root, and vocalism morphemes combine. Hence, the morphotactic component need not worry about templatic morphotactics.</Paragraph>
      <Paragraph position="2"> Linear nontemplatic morphotactics is handled via regular operations, usually n-way concatenation (Kaplan and Kay 1994) in the multitiered case. Consider for example Syriac /?etktab/ and its lexical analysis in Figure 4(b). The lexical analysis of the prefix is/?et, e, C//and that of the stem is (cvcvc, ktb, aa~. Their n-way concatenation gives the tuple /?et cvcvc, ktb, aa/. One may also use the &amp;quot;continuation classes&amp;quot; paradigm familiar from traditional two-level systems (Koskenniemi 1983, inter alia), in which lexical elements on each sublexicon are marked with the set of morpheme classes that can follow on the same sublexicon.</Paragraph>
      <Paragraph position="3"> The last case is that of nonlinear nontemplatic morphotactics. Normally this arises in circumfixation operations. The following morphotactic rule formalism is used to describe such operations:  Unlike traditional finite-state methods in morphology that employ two-tape transducers, the proposed multitiered model requires multitape transducers. The algorithms for compiling the three components into such machines are given next.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="13121" end_page="13121" type="metho">
    <SectionTitle>
4. Algorithms for Compilation into Multitape Automata
</SectionTitle>
    <Paragraph position="0"> Multitape finite-state machines were first introduced by Rabin and Scott (1959), and Elgot and Mezei (1965). An n-tape finite-state automaton (FSA) is a 5-tuple (Q, G, 6, q0, F), where Q is a finite set of states, G is a finite input alphabet, 6: Q x (p,~)&amp;quot; --, 2 Q is a transition function (where G* = G u { ~ } and c is the empty string), q0 E Q is an initial state, and F C_ Q is a set of final states. An n-tape FSA accepts an n-tuple of strings if and only if starting from the initial state q0, it can simultaneously scan all the symbols on every tape i, 1 &lt; i &lt; n, and end up in a final state q E F. An n-tape finite-state transducer (FST) is simply an n-tape finite-state automaton but with each tape marked as to whether it belongs to the domain or range of the transduction. null In addition to common operators under which finite machines are closed, the algorithms discussed below make use of the following three operators: Definition Let L be a regular language. Id,(L) = {X I X is an n-tuple of the form Ix,...,xl, x E L } is the n-way identity of L.</Paragraph>
    <Paragraph position="1"> Definition Let R be a regular relation over the alphabet Y, and let m be a set of symbols not necessarily in G. Insertm (R) inserts the relation Idn (a) for all a E m freely throughout R. The identity and insert operators are the n-tape version of their counterparts in Kaplan and Kay (1994). 3 Definition Let S and S ~ be same-length n-tuples of strings over some alphabet G, I = Idn(a) for some a E G, and S = $1IS2I... Sk, k &gt; 1, such that S i does not contain I; i.e., Si E (~n _ {i}),. We say that Substitut@(s,,i ) (S) = $1S'$2S'... S k substitutes every occurrence of I in S with Sq</Paragraph>
    <Section position="1" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
4.1 Building a Multitape Lexicon
</SectionTitle>
      <Paragraph position="0"> The compilation process builds a one-tape automaton for each sublexicon. The sublexica are then put together using the cross product operator with the effect that the resulting machine accepts entries from the ith sublexicon on its ith tape.</Paragraph>
      <Paragraph position="1"> Representing a lexical entry W in an automaton is achieved by concatenating the symbols of W one after the other. Now let PSi = { W1, W2,... } be the set of lexical entries in the ith sublexicon. The expression for the ith sublexicon becomes:</Paragraph>
      <Paragraph position="3"> (Daciuk et al. \[2000\] give a more sophisticated incremental algorithm for compiling acyclic lexica.) The overall lexicon can then be expressed by taking the cross product of all the sublexica. To make the final lexicon accept same-length tuples, we insert 0s throughout,</Paragraph>
      <Paragraph position="5"> (2) All invalid tuples resulting from the cross product operation (e.g., (0,0 ..... 0)) are removed by the intersection with 7r*, where 7r is the set of all feasible tuples computed from rules (see Section 4.2). By way of illustration, Figure 5 gives the lexicon for the pattern {cvcvc}, the roots {ktb}, {pnq}, {qrb}, and {prq}, and the vocalism {ae}.</Paragraph>
    </Section>
    <Section position="2" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
4.2 Compiling the Rewrite Rules Component
</SectionTitle>
      <Paragraph position="0"> The algorithm for compiling rewrite rules is based on collaborative work by the author with E. Grimley-Evans and S. Pulman (Grimley-Evans, Kiraz, and Pulman 1996). The compilation process is preceded by a preprocessing stage during which all mappings of unequal lengths are made same-length mappings by inserting a special symbol, 0, when necessary. (The grammar writer need not worry about this special symbol, but cannot use it in the grammar.) This is necessary because e-containing transducers are not closed under intersection and subtraction. Additionally, during preprocessing the following sets are computed: the set of all feasible tuples sanctioned by the grammar, 7r (used in expression (2) above); and the set of feasible surface symbols, 7rs (to be used in expression (23) in Appendix A).</Paragraph>
      <Paragraph position="1"> The actual compiler takes as its input rules that have been preprocessed. The algorithm is subtractive in nature: it starts off by creating an automaton that accepts sequences of feasible tuples that are sanctioned by rules regardless of context, then starts subtracting strings which violate the rules. This subtractive approach was first suggested by E. Grimley-Evans.</Paragraph>
      <Paragraph position="2">  A four-tape machine for accepting centers. The symbol s denotes the partition symbol C/. The surface symbol appears to the left of &amp;quot;:&amp;quot; and the lexical tuple to its right.  maps to a surface string if and only if they can be partitioned into pairs of lexical-surface subsequences, where (i) each pair is licensed by a ~ rule, and (ii) no sequence of zero or more adjacent pairs violates a - rule.</Paragraph>
      <Paragraph position="3"> Let c = s:(ll, 12,...I be the center of a rule where s is the surface form and li are the lexical forms, and let C be the set of all such centers in the grammar. Further, let C/ be a special symbol (not in the grammar's alphabet) to denote a subsequence boundary within a partition, and let or' = Tdn(Cr ). The automaton that accepts the centers of the grammar is described by the relation</Paragraph>
      <Paragraph position="5"> Centers accepts any sequence of the centers described by the grammar (each center surrounded by cr/) irrespective of their contexts. Assuming that /ktab/ is under consideration, Figure 6 gives the four-tape machine for the centers of the rules from  texts violate the grammar. For each center c E C in the entire grammar, let LRc = {(&amp;l, Pl), (2~2,/92) .... } be the set of valid left and right context pairs for that center. The invalid contexts for c, are expressed by:</Paragraph>
      <Paragraph position="7"> (2,,.) c LRc The first component of expression (4) gives all the possible contexts for c. The second component gives all the valid contexts for c. The subtraction results in all the invalid contexts for c. However, since C/ appears freely in expression (3), it needs to be introduced in expression (4) as well, resulting in: Restrict:Insert, ( *C * (5) The relation in expression (5) works only if the center consists of just one tuple. In order to allow it to be a sequence of tuples, c must be surrounded by c/on both sides  The machine from Figure 6 is repeated here after processing the rules in Figure 3. to mark it as one Subsequence. It also must be devoid of any C/'. The first condition is accomplished by simply placing er I to the left and right of c. As for the second condition, an auxiliary symbol co is used as a placeholder representing c in order to avoid inserting C/' within the tuples of c by Insert. Hence, first we introduce er freely using Insert, then substitute c back in place of 02,</Paragraph>
      <Paragraph position="9"> where 0Y = Idn(cO). Finally, for each c we subtract all such invalid relations from Centers, yielding the relation,</Paragraph>
      <Paragraph position="11"> ValidContexts now accepts all the sequences of tuples described by the grammar based on their contexts; however, it does not enforce obligatory rules. Figure 7 gives the machine after the center of R3 from Figure 3 has been processed.</Paragraph>
      <Paragraph position="12"> 4.2.3 Obligatory Rules. For each obligatory rule, let C represent the center c with the correct lexical expressions and the incorrect surface expression. The following relation describes all sequences of tuples that contain an unlicensed segment:</Paragraph>
      <Paragraph position="14"> The two C//s surrounding C ensure that obligatoriness applies to at least one lexical-surface subsequence. The Insert operator inserts additional cr/s through the contexts and the center. The insertion of C/' through the center allows Coerce to apply to a series of lexical-surface subsequences. To handle the case of epenthetic rules, one needs to allow Coerce to apply on zero subsequences as well. In such a case, one takes the union of expression (8) with Insert{C} (Aer'p), i.e., the empty subsequence.</Paragraph>
      <Paragraph position="15"> Finally, we subtract Coerce from the ValidContexts relation, yielding the relation:</Paragraph>
      <Paragraph position="17"> The relation accepts all and only the sequences of tuples described by the grammar.</Paragraph>
      <Paragraph position="18"> Figure 8 gives the machine after processing the obligatoriness of rule R3. The last step in compiling rules is to remove all instances of the symbol ~ and the symbol 0.</Paragraph>
    </Section>
    <Section position="3" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
4.3 Compiling the Morphotactic Component
</SectionTitle>
      <Paragraph position="0"> The only class of morphotactics that is of interest here is nonlinear nontemplatic circumfixation per the formalism in Section 3.3. Let (P, S) = { (pl, Sl), (p2, s2),. *., (pn, Sn) } be a set of circumfixes and let B be the domain of the circumfixation operation. A rule A ~ PBS is compiled into an automaton by the expression in (10):</Paragraph>
      <Paragraph position="2"> An equivalent approach to handling circumfixation is to follow the subtractive approach used in compiling the rewrite rules component above by first overgenerating and then subtracting all invalid forms. The two approaches, union or subtraction, are formally equivalent in that they result in the same machine. Compilation using the union approach, however, is more efficient in terms of time complexity than the subtractive approach. The latter requires invoking negation and intersection algorithms, both of which are computationally expensive. It must be noted that the union approach requires a great deal of care in creating a machine that accepts only grammatical forms with no overgeneration.</Paragraph>
      <Paragraph position="3"> Beesley (1998c) employed a similar approach for eliminating invalid forms in Arabic long-distance dependencies by means of composition.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="13121" end_page="13121" type="metho">
    <SectionTitle>
5. Developing Semitic Grammars
</SectionTitle>
    <Paragraph position="0"> When developing Semitic grammars, various issues and problems arise that normally do not occur with linear grammars. This section aims at pointing out some of these issues.</Paragraph>
    <Section position="1" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
5.1 Handling the Nonlinear Stem
</SectionTitle>
      <Paragraph position="0"> The example lexica and rules in Sections 3.1 and 3.2 demonstrate how the CV-based templates are implemented in the current framework. Further, spreading and gemination, which tend to cause difficulties in other computational frameworks (see Section 6), are represented easily in the multitiered model without having to resort to ad hoc notation. For example, the following rule, which uses prosodic templates, demonstrates  Computational Linguistics Volume 26, Number 1 the intrasyllabic spreading of vowels: Intrasyllabic spreading: (a,a, C, V) CVV Extrametricality: {ax, C, e} C/~ C The symbol a~, above is a templatic segment denoting a bimoraic syllable. The rule states that when a,, appears on the pattern tape, a consonant C and a vowel V are read from the root and vocalism tapes, respectively; the corresponding surface segments are CVV. In conjunction with the extrametricality rule above, one obtains surface-lexical tuples like 3aamuus:(a~a~ax, 3ms, au), where the element to the left of &amp;quot;:&amp;quot; is the surface form and the tuple to its right represents the lexical forms (see Figure 2(d)). Gemination is handled in one of two ways: The first marks pattern consonantal segments (C or as) with a subscript (e.g., CIVCaVC3) and provides rules to geminate the appropriate consonant, e.g., Gemination in/kattab/: {c2,X,e) 4:~ XX The second approach leaves pattern segments unmarked, but provides the proper left and right contexts in rules. The former approach requires more annotations in lexica and rules, but provides for smaller machines since no context expressions are used. The opposite holds for the later approach. Both, however, require rule features (see Section 5.2) to ensure that the rule applies only to the desired measures. As the multitiered framework does not put any limitations on the number of lexical tapes, the grammar writer may also choose to place affixes on their own autonomous tape. Hence, one produces surface-lexical tuples like ?akateb:(?a, cvcvc, ktb, ae) (see Figure 1(c)).</Paragraph>
    </Section>
    <Section position="2" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
5.2 Using Features to Handle Idiosyncracies
</SectionTitle>
      <Paragraph position="0"> Various lexical and morphotactic constraints exist in Semitic. For instance, roots do not appear in all verbal measures; rather, each root occurs in the literature in a subset of the measures, e.g., {pnq} does not exist in Measure 1 (one gets such information from dictionaries and corpora). Another example is that the vocalisms of Measure 1 in almost all Semitic languages are lexically marked for each root. In the current framework, categories of the form:</Paragraph>
      <Paragraph position="2"> 1992). Here, a value can be either an atom or a set of atoms. For example, the set of measures in which a root occurs are given in the attribute MEASURE below: \[pattern \]</Paragraph>
      <Paragraph position="4"> Kiraz Multitiered Nonlinear Morphology The vocalisms of Measure I for various roots are marked with the attribute VOWEL. Hence, one gets \[a\] in/ktab/, but \[e\] in/qreb/.</Paragraph>
      <Paragraph position="5"> Categories are also used in rules. For example, the gemination rule above is associated with a category that indicates the measures in which gemination is valid. The definition of rule obligatoriness is extended to include categories (Pulman and Hepple 1993). The categories are incorporated in the automata compilation process following the algorithms in Kiraz (1997a).</Paragraph>
    </Section>
    <Section position="3" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
5.3 Linear versus Nonlinear Grammars
</SectionTitle>
      <Paragraph position="0"> Considering that nonlinearity in Semitic occurs mainly in the stem, maintaining a nonlinear lexical representation in rewrite rules causes rules that describe one phonological/orthographic phenomenon to be duplicated. This becomes a challenge to the grammar writer since Semitic employs very rich phonological rules: assimilation, dissimilation, prosthesis, anaptyxis, syncope, haplology, etc. (Moscati et al. 1969, Section 9.1 ft.).</Paragraph>
      <Paragraph position="1"> Consider the derivation of Syriac/ktab/using the rules in Figure 3. Since vowel deletion in Syriac applies right-to-left, when adding the object pronominal suffix {eh} 'MASC 3RD SING', the second vowel should be deleted, */katabeh/ --* /katbeh/. By virtue of its right lexical context, however, R3 in Figure 3 can only apply to the first stem vowel. Another rule (R4 below) is required for deriving/katbeh/from */katab/ and the suffix {eh}, where the second stem vowel is deleted.</Paragraph>
      <Paragraph position="3"> where V is a vowel.</Paragraph>
      <Paragraph position="4"> The c in the right lexical context is a concrete symbol from a pattern morpheme, while V represents the class of all vowels.</Paragraph>
      <Paragraph position="5"> This does not resolve the problem. Both R3 and R4 fail when the deleted vowel itself appears in the prefix, e.g. /wakatbeh/ --* /wkatbeh/ (with the prefix {wa}), requiring another rule. An additional rule is also needed to delete prefix vowels when the right context belongs to a (possibly another) linear prefix, e.g., prefixing the sequence {wa} 'and', {la} 'to', and {da} 'of' to the stem /katab/ giving /waldaktab/ (the \[a\] of {la} and the first stem vowel are deleted).</Paragraph>
      <Paragraph position="6"> The above examples clearly illustrate the proliferation that would result. Considering that such phonological rules do not depend on the nonlinear lexical structure of the stem, a better approach divides the lexical-surface mappings into two separate problems. The first handles the templatic nature of morphology, mapping the multiple lexical representation into a linearized lexical form, somewhat corresponding to McCarthy's notion of tier conflation (McCarthy 1986). Linearization of autosegmental representations in general has been suggested earlier by Kornai (1991, 1995).</Paragraph>
      <Paragraph position="7"> The second takes care of phonological/orthographic/graphemic mappings between the linearized lexical form and the actual surface. The entire grammar is taken as the composition of two sets of rules (Karttunen, Kaplan, and Zaenen 1992). Composition, however, needs to be defined for multitape machines. First, we redefine an n-tape finite-state machine as (Q, E, 6, q0, F, d), where the first five elements are as before and d, 1 &lt; d &lt; n, is the number of domain tapes (the number of range tapes is simply n - d).</Paragraph>
      <Paragraph position="8">  Computational Linguistics Volume 26, Number 1 There is a composition of A and B, denoted by C, if and only if d2 = nl - dl with</Paragraph>
      <Paragraph position="10"> if and only if sd~+l = s~ .... s,~ = s' * d2.</Paragraph>
      <Paragraph position="11"> The resulting machine is an k-tape machine, where k = dl - d2 q- n2. In our implementation (see Section 7, and Appendix A), the domain and range tapes are given as arguments to the composition function, rather than coding them in machines, in order to allow for flexibility in using machines.</Paragraph>
    </Section>
    <Section position="4" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
5.4 Vocalization
</SectionTitle>
      <Paragraph position="0"> Semitic texts appear in three forms: consonantal texts, which do not incorporate any vowels but matres lectionis; partially vocalized texts, which incorporate some vowels to clarify ambiguity; and vocalized texts, which incorporate full vocalization.</Paragraph>
      <Paragraph position="1"> Handling all such forms is resolved in line with the previous discussion on linearizafion. The grammar writer should assume full vocalization when writing grammars. This will not only get rid of the duplicated rules for the same phonological/orthographic phenomenon, but will also make understanding and debugging rules an easier task. Once a lexical-surface rewrite rules system has been achieved, a set of rules that optionally delete vowel segments are specified and composed with the entire system.</Paragraph>
      <Paragraph position="2">  6. Other Approaches to Finite-State Semitic Morphology</Paragraph>
    </Section>
    <Section position="5" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
6.1 Kay's Multitape Approach
</SectionTitle>
      <Paragraph position="0"> Kay (1987) proposed handling the autosegmental analysis of Arabic by means of multitape automata* Kay adds some extensions to traditional FSTs. Transitions are marked with quadruples of elements (for vocalism, root, pattern, and surface form, respectively), where each element is a pair: a symbol and an instruction concerning the movement of the tape's head. Kay uses the following notation: An unadorned symbol is read and the tape's head moves to the next position. A symbol in brackets, \[ \], is read and the tape's head remains stationary* A symbol in braces, { }, is read and the tape's head moves only if the symbol is the last one on the tape.</Paragraph>
      <Paragraph position="1"> The transitions for the analysis of Syriac/kattab/'to write--CAUSATIVE, PASSIVE', excluding the reflexive prefix {?et}, are shown in Figure 9. After the first transition on the quadruple {\[ \], k, C, k} in Figure 9(a): no symbol is read from the vocalism tape, \[k\] is read from the root tape and the tape's head is moved, \[C\] is read from the pattern tape and the tape's head is moved, and \[k\] is written on the surface tape and the tape's head is moved* At the final configuration, all the tapes have been exhausted.</Paragraph>
      <Paragraph position="2"> Kay makes use of a special symbol, G, to handle gemination; when read, a symbol from the root tape is scanned without advancing the read head of that tape.</Paragraph>
      <Paragraph position="3"> The model suffers from a number of shortcomings, some of which have already been pointed out (Bird and Ellison 1992, Section 5.1). Firstly, the use of various bracketing notations to control the moves of the machine head(s) causes the read heads of the three upper input tapes to move independently of each other; this put the expressiveness of the device under question: Bird and Ellison (1992, but not 1994) questioned the formal power of the device. Wiebe (1992), citing formal results from Fischer (1965),</Paragraph>
      <Paragraph position="5"> \[kia. itit, iai \]a \[k}aititlatbib  Kay's Analysis of Syriac/kattab/. The four tapes are (from top to bottom): vocalism tape, root tape, pattern tape, and surface tape. Transition quadruples are shown at the right side of the tapes. The symbol &amp;quot;~&amp;quot; between the lower surface tape and the lexical tapes indicates the current symbols under the read/write heads.</Paragraph>
      <Paragraph position="6"> stated that Kay's machine goes beyond finite-state power. No consensus has been reached on the matter to the best of the author's knowledge. In contrast, our proposed n-tape machines move all the read heads simultaneously ensuring finite-state expressive power. Secondly, the introduction of ad hoc symbols to templates (e.g., G for gemination) moves away from the spirit of association in autosegmental phonology that Kay wanted to model; other special symbols must also be added to completely implement the rest of the paradigm in question.</Paragraph>
      <Paragraph position="7"> We have demonstrated, however, the usefulness of Kay's proposal. Indeed, if one eliminates the ad hoc controls of the read head(s) and provides a rule formalism from which machines can be compiled algorithmically, the multitape model is quite adequate for describing autosegmental representations.</Paragraph>
    </Section>
    <Section position="6" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
6.2 The Intersection of Lexica Approach
</SectionTitle>
      <Paragraph position="0"> Kataja and Koskenniemi (1988), working on Akkadian, developed a system under traditional two-level morphology. It was mentioned earlier (see Section 1.1) that the challenge of handling Semitic morphology within traditional two-level morphology is that the lexical level is not merely the concatenation of the morphemes in question.</Paragraph>
      <Paragraph position="1"> Kataja and Koskenniemi resolved this by devising a &amp;quot;lexicon component&amp;quot; that makes use of two lexica: one for roots and the other for stem patterns and affixes.</Paragraph>
      <Paragraph position="2"> Entries in the former leave affix elements unspecified, while entries in the latter leave root elements unspecified. For example, the lexical entry for the Arabic root morpheme {ktb} takes the form E~ kEp tEpbEp (11) where Ep is the alphabet of nonroot segments; likewise, the entry for the perfect passive vocalism {ui} takes the form: E r u E r i E r (12) where Gr is the alphabet of root segments. The intersection of both expressions allows for the well-formed string/kutib/, as well as numerous ill-formed sequences such as */ktbui/, inter alia. Under this framework, the result of the intersection becomes the  Computational Linguistics Volume 26, Number 1 lexical level of a traditional two-level morphology system. The two-level system then takes care of other morphological, phonological, and orthographic rules, all of which are linear in nature. Kataja and Koskenniemi suggested simulating the intersection by having the lexical lookup explore both lexica simultaneously.</Paragraph>
      <Paragraph position="3"> The following computational shortcomings of the intersection approach come to mind. The intersection of the two lexica works only if Gp and Gr are disjoint. As this is not the case in Semitic, one has to introduce ad hoc symbols in the alphabet to make the two alphabets disjoint. Alternatively, Beesley (forthcoming) introduces an ingenious, but cumbersome, bracketing mechanism. Expression (11) above becomes:</Paragraph>
      <Paragraph position="5"> order to avoid confusion with set notation). Expression (12) then becomes: B* u B* i B* (14) where B = E - V, and V is the disjunction of all vowels. Finally, each measure is given by an expression; for instance, Arabic Form V (e.g.,/takattab/where the first \[t\] is an affix not related to the \[t\] of the root) is:</Paragraph>
      <Paragraph position="7"> (i.e., the disjunction of the root symbols surrounded by angle brackets). The symbol X in expression (15) indicates gemination in a way reminiscent of Kay's G symbol.</Paragraph>
      <Paragraph position="8"> The intersection of expressions (13), (14), and (15) results in /takatXab/ (X is dealt with by later rules). The disjunction of all such intersections results in what one may call a &amp;quot;quasi lexicon,&amp;quot; i.e, the lexical side of subsequent two-level transducers that deal with linear phenomena (setting aside long-distance dependencies). Given r roots (approximately 4,000 in Modern Standard Arabic), v vocalisms, and p patterns (a few hundred for v x p depending on the linguistic framework used), Beesley's bracketing algorithm needs to perform m intersections, where r KK m &lt; r x v x p (since each root only intersects with lexically defined subsets of the patterns). In contrast, such a bracketing mechanism is not necessary in our multitape approach, since the alphabet of one tape does not interfere with the alphabets of other tapes. Further, our lexicon compiler needs to perform only n - 1 cross product operations (where n is the number of lexical tapes, usually 3). There is a substantial time complexity difference with practical effects. A faithful implementation of Beesley's bracketing approach and ours was performed using the Bell Labs Lextools compiler (Sproat 1995; Kiraz 1997b). (See Sproat \[1997, Section 3.2\] for a brief description of Lextools.) The test was also performed by M. Jansche using a neutral finite-state library (van Noord 1997) to ensure partiality.</Paragraph>
      <Paragraph position="9"> Jansche was able to substantially enhance the performance of Beesley's method. The results of compiling various numbers of roots with the 24 Arabic verbal patterns appear in Table 3. The table indicates that for a full-scale system, the proposed multitier compilation method is far more efficient. Details of the tests appear in Appendix B.</Paragraph>
      <Paragraph position="10"> More serious is the fact that bidirectionality of two-level morphology (i.e., morphemes mapping to surface forms and vice versa) is lost. Once intersection is performed, the result is an accepting automaton that represents stems rather than independent morphemes. In contrast, using our multitape model, the original morphemes  (root, pattern, and vocalism) can be reconstructed from the multitape lexicon by a projection operation. Hence, projection, under which automata are closed, acts as the &amp;quot;reciprocal&amp;quot; operator for cross product in expression (2) ensuring means for bidirectionality. There is no such reciprocal operator for intersection: it is a destructive operator in the sense that its arguments cannot be recovered from the result.</Paragraph>
    </Section>
    <Section position="7" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
6.3 Beesley's Other &amp;quot;Intersection&amp;quot; Approach
</SectionTitle>
      <Paragraph position="0"> Beesley and his colleagues (Beesley, Buckwalter, and Newton 1989; Beesley 1990, 1991) developed a large-scale Arabic system under two-level morphology, which even has the ability to handle regional Egyptian spelling (Beesley, p. c.). The lexical lookup of the two-level model was augmented by a technique called &amp;quot;detouring&amp;quot; to access roots and affixes from different lexica (see Sproat \[1992, 163-165\] for details on &amp;quot;detouring&amp;quot;). In his 1996 paper, and subsequent work, Beesley reimplemented the system using the Xerox lexical and rule compilers (Karttunen 1993; Karttunen and Beesley 1992). An on-line demo of the reimplementation was also developed (Beesley 1998a). 4 Bidirectionality is maintained in Beesley's system by a direct mapping of each root and pattern pair to their respective surface realizations. The lexical description gives the root and pattern superficially concatenated in the form (Beesley 1996, p. c.):</Paragraph>
      <Paragraph position="2"> The square brackets are special symbols that delimit the stem, and &amp;quot;&amp;&amp;quot; is another special symbol that separates the root from the pattern; it is not the intersection operator.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="13121" end_page="13121" type="metho">
    <SectionTitle>
4 The new URL is http://www.xrce.xerox.com/research/mltt/arabic.
</SectionTitle>
    <Paragraph position="0"> Computational Linguistics Volume 26, Number 1 For each root and pattern pair, a rule of the following form is generated automatically: \[ktb&amp;CaCaC\] ~ katab (18) Each rule of the form in (18) is compiled into a transducer, which is then applied by composition to the identity transducer of the corresponding lexical description in (17). The result is a transducer that maps the string &amp;quot;\[ktb&amp;CaCaC\]&amp;quot; into &amp;quot;katab&amp;quot;. It is worth noting that rules of the form in (18) are reminiscent of Chomsky's early transformational rules for Hebrew stems (Chomsky 1951).</Paragraph>
    <Paragraph position="1"> (As &amp;quot;&amp;&amp;quot; in (17) and (18) is a concrete symbol, no real intersection, in the set-theoretic sense, takes place, though Beesley refers to this method as well as to the bracketing mechanism described in the previous section as &amp;quot;intersection&amp;quot;.) This method requires m (where r &lt;&lt; m &lt; r x v x p as before) rules of the form in (18) to be compiled into their respective transducers using algorithms of the Xerox replace operator (Karttunen 1997), literally thousands of rules. Additionally, the entire set of m transducers (or subsets, one subset at a time) needs to be put together into one (or more) transducer(s) by means of intersection (if the transducers are c-free) or composition. Although this takes place during developing the grammar, rather than at run-time, the inefficiency that results in the compilation process is apparent from the fact that a linguistic phenomenon (here, the linearization of stems) is conveyed by applying a rule to every single stem of the language. As one does not provide a rule to delete \[e\] in English/move+ing/and another to delete the same in/charge+ing/, etc., but a single \[e\] deletion rule that applies throughout the entire language, stems in Semitic ought to be realized by rules that represent the phenomenon itself, not every single instance of the phenomenon.</Paragraph>
    <Paragraph position="2"> In contrast, the proposed multitier model requires only three rules throughout the entire language to model Beesley's roots and patterns (i.e., with X to denote gemination and hard-coding vocalic spreading): R1 (for stem consonants) and R2 (for stem vowels) from Figure 3, in addition to the following gemination rule (see Section 5.1 for our handling of gemination and spreading): Gemination:</Paragraph>
    <Paragraph position="4"> The result of the three rules is a mere (\]R I + 1)-state machine, where R is the set of all root segments (= 28 for Arabic, 22 for Syriac), which is then applied to the multitiered lexicon. Figure 10 gives such a machine for R={ k,t,b } and the vowels { a,u,i }.</Paragraph>
    <Paragraph position="5"> Another disadvantage, although a minor one, of rules of the form in (18) is the loss of alignment between surface segments and their lexical counterparts. While this does not affect the behavior of the resulting machines, having segments aligned helps in debugging at the grammar-design stage. A cursory look at the transitions in Figure 10 indicates to the grammar writer the lexical segments that correspond to a surface segment.</Paragraph>
    <Paragraph position="6"> Having said that, Beesley's system remains the largest reported Semitic grammar written within finite-state morphology to date. The system, however, relies on old linguistic models, as old as Harris (1941). No move has been reported to employ modern linguistic models such as the autosegmental framework and other developments mentioned in Section 1, although this seems to be the direction of modern research in computational Semitic morphology as well as linguistics (see the bibliographical entries cited in Section 1).</Paragraph>
    <Paragraph position="7">  Triangular prism demonstrating the autosegmental representation of Arabic/kattab/.</Paragraph>
    <Section position="1" start_page="13121" end_page="13121" type="sub_section">
      <SectionTitle>
6.4 Encoding Autosegmental Representations Approach
</SectionTitle>
      <Paragraph position="0"> There have been a number of proposals to encode autosegmental representations.</Paragraph>
      <Paragraph position="1"> Kornai (1991, 1995) gives a linear coding; Wiebe (1992) and Bird and Ellison (1992) give a multitiered encoding. We shall illustrate this approach from Bird and Ellison's work.</Paragraph>
      <Paragraph position="2"> Every pair of autosegmental tiers constitutes a chart (or plane). The representation of Arabic/kattab/, for example, takes the form of a triangular prism as in Figure 11 (Pulleyblank 1986). Each morpheme sits on one of the prism's three longitudinal edges: the pattern on edge 1-2, the vocalism on edge 3-4, and the root on edge 5-6. The prism has three longitudinal charts: pattern-vocalism (1-2-3-4), pattern-root (1-2-6-5), and root-vocalism (3-4-5-6). The corresponding encoding of the diagram is:  Each expression is an (n + 1)-tuple, where n is the number of charts. The first element in the tuple represents the autosegment. The positions of the remaining elements in the tuple indicate the chart in which an association line occurs, and the numerals indicate the number of association lines on that chart. For example, the expression a:2:0:0 states that the autosegment &amp;quot;a&amp;quot; has two association lines on the first pattern-vocalism chart, zero lines on the second pattern-root chart, and zero lines on the third root-vocalism chart.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 26, Number 1 No implementation of a Semitic language, to the best of the author's knowledge, has been carried out using these methods. Bird and Ellison, however, give an example of how they envisage implementing Semitic using their framework. For each measure, they provide an expression that generalizes over that particular measure, e.g. for {CVCCVC}:</Paragraph>
      <Paragraph position="5"> The numbers after the colon refer to the autosegmental tier with which the segment is linked (0 for vowels and 1 for root segments). A second expression generalizes over all stems constructed from a particular root, e.g., for {ktb}:</Paragraph>
      <Paragraph position="7"> where &amp;quot;*&amp;quot; and &amp;quot;+&amp;quot; denote Kleene star and plus, respectively. A third expression describes the vocalism, e.g., for {a} with spreading: (a U C)* (21) The intersection of the three expressions, per their intersection algorithm for such encodings, yields: k:l a:0 t:l t:l a:0 b:l (22) Superficially, this approach may seem equivalent to the other intersection approaches mentioned in Section 6.2. The methodology here, however, is formally more appealing and linguistically more sound since it provides for a mechanism to describe autosegmental association. Bird and Ellison (1992, 87) question if their approach will &amp;quot;cover all possible generalizations about Arabic verbal structure.&amp;quot; It would definitely be worth investigating how a higher-level autosegmental description of Semitic can be compiled algorithmically into their machines directly.</Paragraph>
      <Paragraph position="8"> We mentioned above (Section 6.2) that the intersection approach lacks bidirectionality. It is possible, though this has not been tested, that the indices in Bird and Ellison's method can play a role in claiming the various morphemes of a particular surface form.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="13121" end_page="13121" type="metho">
    <SectionTitle>
7. Implementation
</SectionTitle>
    <Paragraph position="0"> There are two aspects of the implementation: implementing the theoretical model based on the algorithms presented in Section 4 and implementing Semitic grammars for the case study. As for the former, the algorithms in Section 4 were implemented by the author in SICStus Prolog. Details of the implementation are given in Appendix A.</Paragraph>
    <Paragraph position="1"> As for the grammars themselves, a small-scale Syriac grammar was implemented based on the 100 most frequent roots, including their numerous inflexions, of the Syriac New Testament (Kiraz 1994a). Care was taken, however, to ensure that most of the verbal and nominal classes of the language were exhaustively covered. Additionally, sample Arabic grammars--but with full coverage of the phenomena under question I have been implemented to test various linguistic models of Semitic, including: CV templates, moraic templates, affixational templatic morphology, prosodic circumscription, and broken plurals. A detailed description of handling these linguistic models appears elsewhere (Kiraz, in press).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML