Linearization of Nonlinear Lexical Representations 
George Anton Kiraz 
Bell Laboratories 
700 Mountain Ave. 
Murray Hill, NJ 07974 
Email: gkiraz@research, bell-labs, corn 
Abstract 
This paper presents a new schema for han- 
dling nonli:near morphology. The schema 
argues for linearizing nonlinear represen- 
tations before applying phonological and 
morphological rules. 
1 IntroduCtion and Problem 
Statement 
Languages which exhibit templatic morphology have 
been lately treated using multi-tape finite state 
transducers, with one tape representing surface 
forms and the remaining tapes representing lexical 
forms (see (Kay I 1987; Kiraz, Forthcoming)). There 
are a number of:advantages for using this multi-tape 
model. Not only does it accurately represent the lin- 
guistic insights behind the templatic nonlinear na- 
ture of these languages, it also allows the computa- 
tional linguist to compile efficient, relatively small 
morphological l~xica as opposed to lexica containing 
millions of entri&s. 
However, maintaining a nonlinear lexical represen- 
tation has its ow h inconveniences and computational 
complexities. Firstly, the writer of multi-tape rules 
must keep tracl~ of multiple representations (four 
in the case of Semitic as opposed to two for En- 
glish), which makes, writing grammars an arduous 
task. Secondly, !rules which describe one phonolog- 
ical/orthographi!c phenomenon must be duplicated 
in order to accohnt for the nonlinear nature of the .I 
stem, but the linear nature of segments present in 
prefixes and sufi0xes. Thirdly, in systems which re- 
quire multiple sots of rules (say a text-to-phoneme 
system with two! sets of rules: surface-to-lexical and 
lexical-to-phoneme), the above complexities multi- 
ply. Finally, the're is the issue of space complexity: I 
although the sPace complexity for transitions of an 
automata with respect to the number of tapes is tin- 
ear, space can b@come costly for huge machines, es- 
pecially those whose number of transitions exceeds 
by far the number of states, a typical situation in 
natural language problems. 
This paper provides a finite-state schema with 
which one can maintain the nonlinear lexical rep- 
resentation in templatic morphology, yet allow 
for a linear model for representing phonologi- 
cal/orthographic and other script related rules. Such 
rules are in fact linear and need not be made com- 
plex on the account of the nonlinear templatic phe- 
nomenon of morphology. 
2 Problems in Templatic 
Morphology 
2.1 Nonlinearity vs. Linearity 
Consider the infamous Arabic stem/katab/'to write 
-- PERFECT ACTIVE'. It is derived from the root mor- 
pheme {ktb) 'notion of writing', the vocalism mor- 
pheme {a} 'PERFECT ACTIVE' and .the rather ab- 
stract pattern morpheme {CVCVC} 'VERB.' The 
latter describes the interdigination of the root and 
vocalism. Substituting the Cs with the root conso- 
nants and the Vs with the vocalism vowels results 
in the surface form /katab/. This process is illus- 
trated along the lines of (McCarthy, 1981) - based 
on autosegmental phonology (Goldsmith, 1976) as 
follows: 
= /katab/ C V C V C 
I I I k t b 
Similarly, applying the same process on the root 
{sdq} 'notion of truth' results in the verb /s.adaq/ 
'to speak the truth - PERFECT ACTIVE'. 
The stems /katab/ and /s.adaq/ may be pre- 
fixed and suffixed to fomn other words. Prefixa- 
tion and sumxation, however, are linear operations 
in Semitic. In other words, the lexical representation 
of the prefixes and suffixes does not require multi- 
57 
ple tapes. Hence, the prefix {wa} 'and' is applied to 
the above stems to form/wakatab/and/w~adaq/, 
respectively. 
2.2 Phonological and Orthographic Rules 
Surface-to-lexical mappings must account for phono- 
logical and orthographic processes. In fact, for many 
languages, the phonological and orthographic rules 
tend to be more numerous than the morphological 
rules. This is the case in Semitic. For example, 
the Syriac grammar reported in (Kiraz, 1996) con- 
tains 48 rules. Only six rules (a mere 12.5%) 1 are 
motivated by templatic morphology. The rest are 
phonological and orthographic. 
Consider the above derivation of/katab/, but for 
Syriac rather than Arabic (both languages share the 
same morphemes in this case). Syriac has the Vowel 
Deletion Rule 
V ~ e/__ CV 
where e is the empty string. The rule states that 
short vowels in open syllables are deleted. Hence, 
*/katab/ ~ /ktab/. The rule applies right-to-left; 
hence, when adding the object pronominal suffix 
{eh} 'MASCULINE 3RD SINGULAR', the second vowel 
is deleted, */katabeh/~/katbeh/. 
Similarly, prefixing the above {wa} morpheme 
(which is also shared by Syriac and Arabic), re- 
sults in */wakatab/ ~ /waktab/ (first stem vowel 
is deleted), and */wakatabeh/~/wkatbeh/(prefix 
vowel and second stem vowel are deleted). 
It is worth noting that such phonological rules 
do not depend on the nonlinear lexical structure 
of the stem. They actually apply on the morpho- 
logically derived stem. Semitic, then, maintains 
at least the following strata: lexical-morphological 
(where the lexical representation is nonlinear) and 
morphological-surface (where both representations 
are linear). 
2.3 Other Linguistic Representations 
So far we have looked at two linguistic representa- 
tions: lexical and surface (~ orthographic). Now 
consider a text-to-speech system which requires a 
phonological representation as well. 
In the Arabic example above, the first phoneme 
of /sadaq/ is emphatic (denoted by the sublinear 
dot). This emphasis is spread at the phonologi- 
cal level resulting in \[s.a.d.aq\] (\[q\] is already an em- 
1Had the grammar been more exhaustive, the per- 
centage would be much less since most additions to the 
rules would be in the domain of phonology/orthography, 
rather than templatic morphology. 
phatic phoneme). 2 In this case, emphasis can be 
determined from the surface (~ orthographic) form. 
However, this is not always the case. Syriac spi- 
rantization requires lexical information as the fol- 
lowing example illustrates: Synchronically speaking, 
the six plosives \[b\], \[g\], \[d\], \[k\], \[p\] and \[t\] undergo 
spirantization when in postvocalic position wilh re- 
spect to the lexical form, 3 resulting in \[v\], \[~\], \[b\], \[x\], 
If\] and \[0\], respectively. Hence, */katab/--~ \[k0av\], 
and */wakatab/--~ \[wax0av\] (in both cases the first 
stem vowel is deleted as described above). 
3 Multi-Tape Grammar 
This section provides a grammar for the above data 
using a multi-tape model and illustrates some of the 
complexities involved in maintaining multiple lexical 
tapes throughout. The multi-tape model (originally 
proposed by (Kay, 1987)) is an extension to the com- 
monly used regular rewrite rules. In the multi-tape 
version, more than one lexical tape is allowed. Here, 
we shall use the following formalism - which derives 
from the one reported by (Pulman and Hepple, 1993) 
- to express regular rewrite rules: 
LLC - LEx - RLC {:::~,~-:~) 
LSC - SURF -- I:~SC 
where LLC is the left lexical context, L~x is the 
lexical form, RLC is the right lexical context, LSC 
is the left surface context, SURF is the surface form, 
and RSC is the right surface context. The operators 
and ¢:~ indicate optional and obligatory rules, re- 
spectively. In the multi-tape version, lexical expres- 
sions are n-tuple of regular expressions of the form 
(xl, x2, ..., x, 0, with the ith expression referring 
to symbols on the ith lexical tape. When n = 1, 
the parentheses can be ignored; hence, (x) and x are 
equivalent .4 
The grammars presented here assumes a lexicon 
with the morpheme entries presented above. The 
pattern morpheme is {cvcvc} (in small letters); cap- 
itals in rules denote variables drawn from a finite-set 
of symbols. 
Lexieal expressions make use of three tapes: pat- 
tern, root and vocalism, respectively. Hence, the 
2The scope of emphasis is another challenging prob- 
lem. Sometimes emphasis spreads till the end of the 
current syllable, and sometimes till the end of the word. 
3Diachronically speaking, early Aramaic idioms, of 
which Syriac is one, did not apply the above vowel dele- 
tion rule; hence, in the New Testament the first \[a\] in 
sabachthani (Mt 27:46) is retained. Later, however, the 
vowel deletion rule took effect, but spirantized conso- 
nants remained as if the deletion did not take place. 
4For compiling such rules into automata, see 
(Grimley-Evans, Kiraz, and Pulman, 1996). 
Grammar 1 Grammar for Arabic/(wa)katab/and 
/(wa)s. adaq/ 
* - <c;C,~> - * R1 , _ IC - * 
* - (v:,c,a) - * R2 . _ ~a - * 
* - Xi - * R3 . - X - * 
where X is any segment, C is a consonant, and * is 
any context. 
Grammar 2 Grammar for Syriac Vowel Deletion 
Rule 
* - (v,~,a) - (cv,C,a) ¢* R4 . 
_ _ . 
* - {v,~,a) - (cV,C,¢ ¢* R5 , _ _ , 
* - a - (cv,C,a) ¢* R6 , , 
R7 
* - a - CV 
* _ _ * 
where C is a consonant and V is a vowel 
lexical expression (c,k,s) denotes a \[c\] on the first 
(pattern) tape,i a \[k\] on the second (root) tape and 
the empty string on the third (vocalism) tape. Pre- 
fixation and suffixation, which for the most part fall 
out of the domain of templatic morphology, are rep- 
resented as a sequence of segments as in any other 
language and ate placed on the first (pattern) lexical 
tape. 5 
3.1 Nonlinearity vs. Linearity 
Rules R1 and R2 in Grammar 1 take care of con- 
sonants and vo~wels, respectively. The rules derive 
the Arabic forms /katab/ and /s.adaq/. R3 is the 
default rule for prefixes and suffixes. It simply maps 
every segment on the .first lexical tape to the sur- 
face. Grammar' 1 derives the forms/wakatab/and 
/w~.adaq/as well. The former is illustrated below. 
I It I b Root 
w atclvlc)tvlc Pattern and Affixes 
33121121 
\[wl alk la I t I albl Surface 
The numbers:between the tapes refer to the rules 
in the grammaR. Note that the prefix shares a tape 
with the pattern. 
3.2 Phonological and Orthographic Rules 
The Syriac vowel deletion rule, V ---+ e/ CV, 
is given in the notation of our formalism in Gram- 
mar 2. Note that by virtue of its right-lexical con- 
text (cv,C,a), R4 can only apply to the first stem 
vowel as illustrated in the derivation of/ktab/from 
the underlying */katab/by the deletion of the first 
vowel: 
SHaving tile prefixes share a tape of tile patterns is a 
matter of convenience since the number of segments in a 
pattern, more or less, corresponds to that on the surface 
more than segments of roots and vocalisms. 
l a l a t Vocalism 
k t I I b Root 
c v civic Pattern 
14121 Ik t la!blSurface 
Another rule (R5 in Grammar 2) is required for 
deriving */katab/ + {eh} ~ /katbeh/, where the 
second stem vowel is deleted by the same phono- 
logical phenomenon. The difference here lies in the 
right-lexical context expression (cV,C,e), where the 
suffix vowel appears on the first lexical tape. The 
derivation is illustrated below: 
t al a t i. Vocatism 
kl It bi i Root 
c!vlc v cte!11 Pattern and Affixes 
1215133 Iklalt b l e l h \[ Surface 
R4 and R5 fail when the deleted :vowel itself ap- 
pears in the prefix, e.g. {wa} +/katbeh/--~/wkat- 
beh/. R6 handles this case; here, the right context 
(cv,C,a) belongs to the nonlinear stem as shown be- 
low: 
If al all \]Vocalism 
t I k 1 t b l Root 
!alc vlc v cielh Pattern and Affixes 
361215133 
Iwl fk!a!t ib!e!hlSurZace 
In addition, R7 deletes prefix vowels when the right 
context belongs to a (possibly another) linear prefix, 
e.g., {wa} + {la} + {da} +/katab/---~/waldaktab/ 
(the \[a\] of {la\] and the first stem vowel are deleted), 
as illustrated below: 
t t I t t Ikl Itl Root w\]alllald!a!clv\[c\[vl Pattern and Affixes 
33373314121 
Iw\[a\[l{ ld\[a. lkl Itla\[b\]Surface 
The above examples clearly illustrate the com- 
plexity of maintaining large nonlinear grammars. 
4 Using a Linearized Lexical 
Representation 
This section argues that a better framework for solv- 
ing Semitic morphology divides the lexical-surface 
mappings into two separate problems. The first 
handles the templatic nature of morphology, map- 
ping the multiple lexical representation into a lin- 
earized lexieal form. This linearized form main- 
tains the same linguistic information of the original 
lexical representation, and somewhat corresponds to 
McCarthy's notion of tier conflation (McCarthy, 
1986). 
The second takes care of phonological/ ortho- 
graphic/graphemic mappings between the linearized 
lexical form and the actual surface. The combined 
machine is mathematically taken as the composi- 
tion of the two machines representing the two sets of 
rules. This brings us to the question of composing 
multi-tape automata. 
4.1 Composition of Multi-Tape Machines 
The composition of two binary transducers A and B 
is straightforward since one tape is taken for input 
and the other for output. The composition of the 
two machines is a generalization of the intersection 
of the same two automata in that each state in the 
resulting machine is a pair drawn from one state in 
A and the other from B, and each transition corre- 
sponds to a pair of transitions, one from A and the 
other from B, with compatible labels. 
The composition of multi-tape transducers, how- 
ever, is ambiguous. Which tapes are input and 
which are output? Consider the machine which ac- 
cepts the regular relation 6 a*:b*:b* and a second 
machine which accepts the regular relation b* :b* :c*. 
The composition of the two machines can be either 
the machine accepting a* :c* or the machine accept- 
ing a* :b* :b* :c*. However, if tapes can be marked as 
belonging to the domain or range of the transduc- 
tion, the ambiguity will be resolved. 
Formally, an n-tape finlte-state automaton is 
a 5-tuple M = (Q, Z, 5, q0, F), where Q is a finite 
set of states, E is a finite input alphabet (a set 
of n-tuples of symbols), t~ is a transition function 
mapping Q x E'~ to Q, q0 E Q is an initial state, 
and F C Q is a set of final states. An n-tape FSA 
accepts an n-tuple of strings if and only if starting 
from the initial state q0, it can scan all the symbols 
6For regular relations, see (Kaplan and Kay, 1994). 
on every tape i, 1 _< i < n, and end up in a final 
state q E F. 
An n-tape finite-state transducer is a 6-tuple 
M = (Q,E,5, qo, F,d), where Q, .~, 6, q0 and F 
are like before and d, 1 < d < n, is the number of 
domain tapes. The number of range tapes is 
simply n - d. 
Let A = (Qi, El, 51, ql, Fi, dl) and B = 
(Q2, E2, 52, q2, F2, d2) be two multi-tape transducers 
over nl and n2 tapes, respectively. Further, let si 
denote the symbol on the ith tape. There is a com- 
position of A and B, denoted by C, if and only if 
d2 = nl -- dl 
with 
C = (Q1 × Q2, ~1 i.j ~2, 5, \[ql, q2\], F1 x F2, dl) 
where for all Pl E Q1 and p2 E Q2, 
d2+1 : "'" : ~) "~ 
\[51(pl,sl :... :sdl : s~,+l :... : s,,), 
52(p~,st:...:s' s' s' d~ : ~+1 :-..: -2)\] 
if and only if 
/ 8 t 8d1+1 ~ 81,'" ",8nl ~ d2 
The resulting machine is an k-tape machine, 
where k = dl - d2 + n2. 
Implementational Note 
We found that it is best not to indicate d, the 
number of domain tapes, in the data structure rep- 
resenting the automata, but to hav~ it as an argu- 
ment to the composition function. This enables the 
user to change the value of d per operation if the 
need arises. 
4.2 A Mixed Grammar 
Now we illustrate the advantage of having a lin- 
earized lexical form by developing a mixed grammar. 
We make use of two grammars for the data pre- 
sented above. G1 for templatic nonlinear problems 
and G2 for linear issues. For the current data, our 
G1 would be similar to the rules in Grammar 1. 
G2 takes as input the output of G1, i.e., the lin- 
earized lexical form such as Syriac */katab/, */wal- 
adakatab/, etc. Since R4-R7 in Grammar 2 rep- 
resent the one phonological phenomenon, viz., the 
deletion of a short vowel in an open syllable, they 
can be combined into one rules: 
* - a - CV 
R8 * - - * 
where C is a consonant and V is a vowel 
~o 
Grammar 3 Grammar for Spirantization, case for 
\[b\]--~ Iv\] 
V - b - * ¢~ lZ9 . , 
-- V -- 
RiO V - (c,b,c) - * ¢:.,~ 
• -- V -- * 
Rll 
(v,e,V) - (c,b,e} - * 
• -- V -- * 
where V is a vowel 
An identity rule (similar to R3 is also required). 
Applying R8 and the identity rule on the input of 
G2 is illustrated below: 
IwlalllaldlalktaltlalblLinearizedLexForm 
33383338333 
Iwlatl\[ Idtalkl Itla\[blSurfaee 
Recall that the rule applies right-to-left. 
It might not be clear from this example how ad- 
vantageous is this solution. After all, only three rules 
were saved. However, note that almost all of the 
rules in a real grammar do not belong to the tem- 
platte morphology domain, but to the linear phono- 
logical\]orthographic domain. Consider the case of 
Syriac spirantization mentioned above, viz., 
\[- plosive\] ~ \[+ fricative\] / V __ 
Each of the six Syriac plosives requires a set of 
rules of the form in Grammar 3:R9 applies when 
the center and context belong to prefixes and suf- 
fixes, R10 applies when the center belongs to the 
stem and the context belongs to a prefix, and Rll 
applies when the center and context belong to the 
stem. (Since Syriac stems invariably end in conso- 
nants, there is no rule for the case when the center 
belongs to a suffix and the right context to the stem 
in this case.) To cover all six plosives, 18 rules are 
required. If, however, the rules are to apply on the 
linearized lexical form, each plosive requires only one 
rule similar to R9 (a total of six rules). 
i 
P, 
5 Conclusilon 
Using a lineari~ed form provides a pragmatic solu- 
tion to the pr6blems discussed above. While the 
templatic mo@hology issues are resolved using a 
multi-tape grammar, the linear-in-nature phonologi- 
cal/graphemic issues are dealt with using a two-tape 
grammar as in lany other Western language. As il- 
lustrated with ithe vowel deletion rule above, this 
makes the task I of the grammar writer easier by far. 
In addition, the size of the intermediate automata is 
substantially decreased in terms of space complexity. 
There is another advantage of this model if used 
in a multi-lingual Semitic environment system. We 
noted above how the derivation of/katab/in Arabic 
and Syriac is similar. The only difference is that in 
the latter a vowel deletion rule takes place. It is 
then possible to generalize the lexical-to-linearized- 
form module for more than one Semitic language. 
At the abstract finite-state level, our solution may 
have some similarities with the proposal of (Kor- 
nat, 1991) which aims at modeling autosegmental 
phonology by coding nonlinear autosegmental repre- 
sentations as linear strings. Kornai's approach lin- 
earizes the lexical nonlinear representation from the 
outset using a number of coding mechanisms. 

References 
Goldsmith, J. 1976. Autosegmental Phonology. 
Ph.D. thesis, MIT. Published as Autosegmental 
and Metrical Phonology, Oxford 1990. 
Grimley-Evans, E., G. Kiraz, and S. Putman. 1996. 
Compiling a partition-based two-level formalism. 
In COLING-96: Papers Presented to the 16th 
International Conference on Computational Lin- 
guistics. 
Kaplan, R. and M. Kay. 1994. Regular models of 
phonological rule systems. Computational Lin- 
guistics, 20(3):331-78. 
Kay, M. 1987. Nonconcatenative finite-state mor- 
phology. In Proceedings of the Third Conference 
of the European Chapter of the Association for 
Computational Linguistics, pages 2-10. 
Kiraz, G. 1996. S. EMHE: A generalised two-level 
system. In Proceedings of the 34th Annual Meet- 
ing of the Association for Computational Linguis- 
tics. 
Kiraz, G. \[Forthcoming\]. Computational Ap- 
proach to Nonlinear Morphology: with empha- 
sis on Semitic languages. Cambridge University 
Press. 
Kornai, A. 1991. Formal Phonology. Ph.D. thesis, 
Stanford University. 
McCarthy, J. 1981. A prosodic theory of non- 
concatenative morphology. Linguistic Inquiry, 
12(3):373-418. 
McCarthy, J. 1986. OCP effects: gemination and 
antigemination. Linguistic Inquiry, 17. 
Pulman, S. and M. Hepple. 1993. A feature-based 
formalism for two-level phonology: a description 
and implementation. Computer Speech and Lan- 
guage, 7:333-58. 
