Efficient Generation in Primitive Optimality Theory 
Jason Eisner 
Dept. of Computer and Information Science 
University of Pennsylvania 
200 S. 33rd St., Philadelphia, PA 19104-6389, USA 
j eisner@linc, cis. upenn, edu 
Abstract 
This paper introduces primitive Optimal- 
ity Theory (OTP), a linguistically moti- 
vated formalization of OT. OTP specifies 
the class of autosegmental representations, 
the universal generator Gen, and the two 
simple families of permissible constraints. 
In contrast to less restricted theories us- 
ing Generalized Alignment, OTP's opti- 
mal surface forms can be generated with 
finite-state methods adapted from (Ellison, 
1994). Unfortunately these methods take 
time exponential on the size of the gram- 
mar. Indeed the generation problem is 
shown NP-complete in this sense. How- 
ever, techniques are discussed for making 
Ellison's approach fast in the typical case, 
including a simple trick that alone provides 
a 100-fold speedup on a grammar fragment 
of moderate size. One avenue for future 
improvements is a new finite-state notion, 
"factored automata," where regular lan- 
guages are represented compactly via for- 
mal intersections N~=IAi of FSAs. 
1 Why formalize OT? 
Phonology has recently undergone a paradigm shift. 
Since the seminal work of (Prince & Smolensky, 
1993), phonologists have published literally hun- 
dreds of analyses in the new constraint-based frame- 
work of Optimality Th.eory, or OT. Old-style deriva- 
tional analyses have all but vanished from the lin- 
guistics conferences. 
The price of this creative ferment has been a cer- 
tain lack of rigor. The claim for O.T as Universal 
Grammar is not substantive or falsifiable without 
formal definitions of the putative Universal Gram- 
mar objects Repns, Con, and Gen (see below). 
Formalizing OT is necessary not only to flesh it out 
as a linguistic theory, but also for the sake of compu- 
tational phonology. Without knowing what classes 
of constraints may appear in grammars, we can say 
only so much about the properties of the system, 
or about algorithms for generation, comprehension, 
and learning. 
The central claim of OT is that the phonology of 
any language can be naturally described as succes- 
sive filtering. In OT, a phonological grammar for 
a language consists of ~ vector C1, C2, • .. C, of soft 
constraints drawn from a universal fixed set Con. 
Each constraint in the vector is a function that scores 
possible output representations (surface forms): 
(1) Ci : Repns --* {0, 1, 2,...} (Ci E Con) 
If C~(R) = 0, the output representation R is said to 
satisfy the ith constraint of the language. Other- 
wise it is said to violate that constraint, where the 
value of C~(R) specifies the degree of violation. Each 
constraint yields a filter that permits only minimal 
violation of the constraint: 
(2) Filteri(Set)= {R E Set : Ci(R) is minimal} 
Given an underlying phonological input, its set of 
legal surface forms under the grammar--typically of 
size 1--is just 
(3) Filter, (...Filter,. (Filter 1 (Gen(input)))) 
where the function Gen is fixed across languages 
and Gen(input) C_ Repns is a potentially infinite 
set of candidate surface forms. 
In practice, each surface form in Gen(input) must 
contain a silent copy of input, so the constraints 
can score it on how closely its pronounced material 
matches input. The constraints also score other cri- 
teria, such as how easy the material is to pronounce. 
If C1 in a given language is violated by just the forms 
with coda consonants, then Filterl(Gen(input)) in- 
cludes only coda-free candidates--regardless of their 
other demerits, such as discrepancies from input 
or unusual syllable structure. The remaining con- 
straints are satisfied only as well as they can be given 
this set of survivors. Thus, when it is impossible 
to satisfy all constraints at once, successive filtering 
means early constraints take priority. 
Questions under the new paradigm include these: 
• Generation. How to implement the input- 
output mapping in (3)? A brute-force approach 
313 
fails to terminate if Gen produces infinitely 
many candidates. Speakers must solve this 
problem. So must linguists, if they are to know 
what their proposed grammars predict. 
• Comprehension. How to invert the input- 
output mapping in (3)? Hearers must solve this. 
• Learn,ng. How to induce a lexicon and a 
phonology like (1) for a particular language. 
given the kind of evidence available to child lan- 
guage learners? 
None of these questions is well-posed without restric- 
tions on Gen and Con. 
In the absence of such restrictions, computational 
linguists have assumed convenient ones. gllison 
(1994) solves the generation problem where Gen 
produces a regular set of strings and Con admits 
all finite state transducers that can map a string to 
a number in unary notation. (Thus Ci(R) = 4 if the 
Ci transducer outputs the string llll on input R.) 
Tesar (1995. 1996) extends this result to the case 
where Gen(mput) is the set of parse trees for input 
under some context-free grammar (CFG)3 Tesar's 
constraints are functions on parse trees such tha~ 
Ci(\[A \[B1.. \] \[B~.-- .\]\]) can be computed from A, B:, 
B2, Ci(B1), and Ci(B~.). The optimal tree can then 
be found with a standard dynamic-programming 
chart parser for weighted CFGs. 
It is an important question whether these for- 
malisms are useful in practice. On the one hand, are 
they expressive enough to describe real languages? 
On the other, are they restrictive enough to admit 
good comprehension and unsupervised-learning al- 
gorithms? 
The present paper sketches primitive Optimal- 
ity Theory (OTP)--a new formalization of OT 
that is explicitly proposed as a linguistic hypothe- 
sis. Representations are autosegmental, Gen is triv- 
ial, and only certain simple and phonologically local 
constraints are allowed. I then show the following: 
i. Good news: Generation in OTP can be solved 
attractively with finke-state methods. The so- 
lution is given in some detail. 
2. Good news: OTP usefully restricts the space of 
grammars to be learned. (In particular. Gener- 
alized Alignment is outside the scope of finite- 
state or indeed context-free methods.} 
3. Bad news: While OTP generation is close to lin- 
ear on the size of the input form. it is NP-hard 
on the size of the grammar, which for human 
languages is likely to be quite large. 
4. Good yews: Ellison's algorithm can be improved 
so that its exponential blowup is often avoided. 
*This extension is useful for OT syntax but may have 
little application to phonology, since the context-free 
case reduces to the regular case (i.e., Ellison) unless the 
CFG contains recursive productions. 
2 Primitive Optimality Theory 
Primitive Optimality Theory. or OTP. is a formal- 
ization of OT featuring a homogeneous output repre- 
sentation, extremely' local constraints, and a simple, 
unrestricted Gen. Linguistic arguments t'or OTP's 
constraints and representations are given in !Eisner. 
1997). whereas the present description focuses ,an its 
formal properties and suitability for computational 
work. An axiomatic treatment is omitted for rea- 
sons of space. Despite its simplicity. OTP appears 
capable of capturing virtually all analyses found in 
the (phonological) OT literature. 
2.1 Repns: Representations in OTP 
To represent imP\], OTP uses not the autosegmentai 
representation in (4a) IGoldsmith. 1976: Goldsmith. 
1990) but rather the simplified autosegmental rep- 
resentation in (4b), which has no association lines. 
Similarly (Sa) is replaced by (Sb). The central rep- 
resentational notion is that of a constituent time- 
line: an infinitely divisible line along on which con- 
stituents are laid out. Every constituent has width 
and edges. 
(4) a. voi b. ,~o,\[ • J t,o~ 
haS/ n~Jt \]ha, 
1/ c! cIc \]c ! 
C C lab\[ jlab 
lab 
For phonetic interpretation: \]~o, says to end voic- 
ing (laryngeal vibration). At the same instant, 
\],,~, says to end nasality (raise velum}. 
(5) a. 
O" O" 
/1\ /I 
CVCV 
b. ~\[ 
C\[ ~ " : C ,-" 
~a 
\]¢-. j- 
V i .V 
.k timeline can carry tl~e full panoply of phonolog- 
ical and morphological ,:onstituents--an.vthing that 
phonological constraints might have to refer to. 
Thus, a timetine bears not only autosegmental fe,.'> 
tures like nasal gestures inasi and prosodic ,:on- 
stituents such as syllables \[o'\]. but also stress marks 
\[x\], feature dpmains such as \[ATRdom\] (Cole L: 
Kisseberth, 1994) and morphemes such as \[Stem i. 
All these constituents are formally identicah each 
marks off an interval on the timeline. Let Tiers de- 
note the fixed finite set of constituent types. {has. 
~. x, ATRdom. S*.em .... }. 
It is always possible to recover the old representa- 
tion (4a) from the new one (4b), under the conven- 
tion that two constituents on the timeline are linked 
if their interiors overlap (Bird & Ellison, 1994). The 
interior of a constituent is the open interval that 
314 
excludes its edges: Thus, lab is linked to both con- 
sonants C in (4b), but the two consonants are not 
linked to each other, because their interiors do not 
overlap. 
By eliminating explicit association lines, OTP 
eliminates the need for faithfulness constraints on 
them, or for well-formedness constraints against gap- 
ping or crossing of associations. In addition, OTP 
can refer naturally to the edges of syllables (or mor- 
phemes). Such edges are tricky to define in (5a), be- 
cause a syllable's features are scattered across multi- 
ple tiers and perhaps shared with adjacent syllables. 
In diagrams of timelines, such as (4b) and (5b), 
the intent is that only horizontal order matters. 
Horizontal spacing and vertical order are irrelevant. 
Thus, a timeline may be represented as a finite col- 
lection S of labeled edge brackets, equipped with or- 
dering relations -~ and " that indicate which brack- 
ets precede each other or fall in the same place. 
Valid timelines (those in Repns) also require that 
edge brackets come in matching pairs, that con- 
stituents have positive width, and that constituents 
of the same type do not overlap (i.e., two con- 
stituents on the same tier may not be linked). 
2.2 Gem Input and output in OTP 
OT's principle of Containment (Prince & Smolen- 
sky, 1993) says that each of the potential outputs in 
Repns includes a silent copy of the input, so that 
constraints evaluating it can consider the goodness 
of match between input and output. Accordingly, 
OTP represents both input and output constituents 
on the constituent timeline, but on different tiers. 
Thus surface nasal autosegments are bracketed with 
,~as\[ and \],,a~, while underlying nasal autosegments 
are bracketed with ,as\[ and \] .... The underlining 
is a notational convention to denote input material. 
No connection is required between \[nas\] and \[nas! 
except as enforced by constraints that prefer \[nas\] 
and \[nas\] or their edges to overlap in some way. (6) 
shows a candidate in which underlying \[nas\] has sur- 
faced "in place" but with rightward spreading. 
(6) ~o,\[ \]~o~ 
.o,\[ \].o, 
Here the left edges and interiors overlap, but the 
right edges fail to. Such overlap of interiors may be 
regarded as featural Input-Output Correspondence 
in the sense of (McCarthy & Prince, 1995). 
The lexicon and morphology supply to Gen an 
underspecified timeline--a partially ordered col- 
lection of input edges. The use of a partial ordering 
allows the lexicon and morphology to supply float- 
ing tones, floating morphemes and templatic mor- 
phemes. 
Given such an underspecified timeline as lexical 
input, Gen outputs the set of all fully specified time- 
lines that are consistent with it. No new input con- 
stituents may be added. In essence, Gen generates 
every way of refining the partial order of input con- 
stituents into a total order and decorating it freely 
with output constituents. Conditions such as the 
prosodic hierarchy (Selkirk, 1980) are enforced by 
universally high-ranked constraints, not by Gen. -~ 
2.3 Con: The primitive constraints 
Having described the representations used, it is now 
possible to describe the constraints that evaluate 
them. OTP claims that Con is restricted to the 
following two families of primitive constraints: 
(7) a --* /3 ("implication"): 
"Each ~ temporally overlaps some ~." 
Scoring: Constraint(R) = number of a's in R 
that do not overlap any 8. 
(8) a 3- /3 ("clash"): 
"Each cr temporally overlaps no/3." 
Scoring: Constraint(R) = number of (a, ';3) 
pairs in R such that the a overlaps the/3. 
That is, a --~ /3 says that a's attract /3's, while 
a 3_ /3 says that c~'s repel/3's. These are simple and 
arguably natural constraints; no others are used. 
In each primitive constraint, cr and /3 each spec- 
ify a phonological event. An event is defined to be 
either a type of labeled edge, written e.g. ~\[, or 
the interior (excluding edges) of a type of labeled 
constituent, written e.g.a. To express some con- 
straints that appear in real phonologies, it is also 
necessary to allow, a and /3 to be non-empty con- 
junctions and disjunctions of events. However, it 
appears possible to limit these cases to the forms in 
(9)-(10). Note that other forms, such as those in 
(11), can be decomposed into a sequence of two or 
~The formalism is complicated slightly by the pos- 
sibility of deleting segments (syncope) or inserting seg- 
ments (epenthesis), as illustrated by the candidates be- 
low. 
(i) Syncope (CVC ~ CC): the _V is crushed to zero 
width so the C's can be adjacent. 
c\[ Ic \]c ~\[ 1~_ \]~ 
vlv 
(ii) Epenthesis (CC ~ CVC): the C__'s are pushed 
apart. 
c\[ \]~ ~\[ \]~ ~_\[ \]~_ ~\[ \]~ 
In order to Mlow adjacency of the surface consonants in 
(i), as expected by assimilation processes (and encour- 
aged by a high-ranked constraint), note that the underly- 
ing vowel must be allowed to have zero width--an option 
available to to input but not output constituents. The 
input representation must specify only v\[ "< Iv, not 
v\[ ~ \]v. Similarly, to allow (ii), the input representa- 
tion must specify only \]c, __. c_~\[, not \]o, ~ c2\[. 
315 
more constraints. 3 
(9) ( c~1 and a~ and ... ) ---* (/31 or/32 or ...) 
Scoring: Constraint(R) = number of sets of 
events {A1, A2,...} of types (~l, a, .... respec- 
tively that all overlap on the timeline and 
whose intersection does not overlap any event 
of type/31,/3.,, • ... 
(10) (al anda2 and ...) .L (/31 and/3~ and ...) 
Scoring: Constraint(R) = number of sets 
of events {A1,A~.,..., B1,B~ .... } of types 
oq,a~ .... ,/31,/32,... respectively that all 
overlap on the timeline. 
(Could a/so be notated: 
al ± a2 ± "" ± Zl ± /~2 ± "".) 
(11) ¢X ~ ( fll and /32 ) \[cf. o~ ~ /31 >> c~ --~ /32\] 
( cq or ~.~ ) --* ,3 \[cf. ~1 ---* /3 >> a.~ --~ /3\] 
The unifying theme is that each primitive con- 
straint counts the number of times a candidate gets 
into some bad local configuration. This is an inter- 
val on the timeline throughout which certain events 
(one or more specified edges or interiors) are all 
present and certain other events (zero or more spec- 
ified edges or interiors) are all absent. 
Several examples of phonologically plausible con- 
straints, with monikers and descriptions, are given 
below. (Eisner, 1997) shows how to rewrite hun- 
dreds of constraints from the literature in the primi- 
tive constraint notation, and discusses the problem- 
atic case of reduplication. (Eisner, in press) gives 
a detailed stress typology using only primitive con- 
straints; in particular, non-local constraints such 
as FTBIN, FOOTFORM, and Generalized Alignment 
(McCarthy & Prince, 1993) are eliminated. 
(12) a. ONSET: a\[- C\[ 
"Every syllable starts with a consonant." 
b. NONFINALITY: \]Wo,-d _1_ \]F 
"The end of a word may not be footed." 
c o\[ , 
l"eet start and end on syllable boundaries." 
d. PACKFEET: \]F ""+ F\[ 
"Each foot is followed immediately by an- 
other foot; i.e., minimize the number of gaps 
between feet. Note that the final foot, if any, 
will always violate this constraint." 
e, NOCLASH: \]X A_ x\[ 
"Two stress marks may not be adjacent." 
f. PROGRESSIVEVOICING: \]voi _1_ C\[ 
"If the segment preceding a consonant is 
voiced, voicing may not stop prior to the 
3Such a sequence does alter the meaning slightly. To 
get the exact original meaning, we would have to de- 
compose into so-cMled "unranked" constraints, whereby 
Ci (R) is defined as C,, (R)+Ci~ (R). But such ties under- 
mine OT's idea of strict ranking: they confer the power 
to minimize linear functions such as (C1 + C1 + C1 + 
C2 + C3 + C3)(R) = 3C1 (R) + C2(R) + 2C3(R). For this 
reason, OTP currently disallows unranked constraints; I 
know of no linguistic data that crucially require them. 
consonant but must be spread onto it." 
g, NASVOI: nas -- voi 
"Every nasal gesture must be at least partly 
voiced." 
h. FULLNASVOI: has _\[_ voi\[, has I \]voi 
"A nasal gesture may not be only partly 
voiced." 
i. MAX(VOi) or PARSE(voi): vo._i ~ voi 
"Underlying voicing features surface." 
j. DEP(voi) or FILL(voi): voi ---, voi 
"Voicing features appear on the surface only 
if they are a/so underlying." 
k. NoSPREADRIGHT(voi): voi _1_ \]vo__i_ 
"Underlying voicing may not spread right- 
ward as in (6)." 
h NONDEGENERATE: F --~\[ 
"Every foot must cross at least one morn 
boundary ,\[." 
m. TAUTOMORPHEMICFOOT: F _\]_ .~Iorph\[ 
"No foot may cross a morpheme boundary." 
3 Finite-state generation in OTP 
3.1 A simple generation algorithm 
Recall that the generation problem is to find the 
output set S,~, where 
(13) a. So = Gen(inpu~) C_ Repns 
b. Si+l = Filteri+l(Si) C Si 
Since in OTP, the input is a partial order of edge 
brackets, and Sn is a set of one or more total orders 
(timelines), a natural approach is to successively re- 
fine a partial order. This has merit. However, not 
every Si can be represented as a single partial order, 
so the approach is quickly complicated by the need 
to encode disjunction. 
A simpler approach is to represent Si (as well 
as inpu~ and Repns) as a finite-state automaton 
(FSA), denoting a regular set of strings that encode 
timelines. The idea is essentially due to (Ellison, 
1994), and can be boiled down to two lines: 
(14) Ellison's algorithm (variant). 
So = input N Repns 
= all conceivable outputs containing input 
Si+l = BestP~tths(Si N Ci+l) 
Each constraint Ci must be formulated as an edge- 
weighted FSA that scores candidates: Ci accepts any 
string R, on a singl e path of weight Ci(R). 4 Best- 
Paths is Dijkstra's "single-source shortest paths" 
algorithm, a dynamic-programming algorithm that 
prunes away all but the minimum-weight paths in 
an automaton, leaving an unweighted automaton. 
OTP is simple enough that it can be described in 
this way. The next section gives a nice encoding. 
4Weighted versions of the state-labeled finite au- 
tomata of (Bird & EUison, 1994) could be used instead. 
316 
3.2 OTP with automata 
We may encode each timeline as a string over an 
enormous alphabet E. If \[Tiersl = k, then each 
symbol in E is a k-tuple, whose components describe 
what is happening on the various tiers at a given 
moment. The components are drawn from a smaller 
alphabet A = { \[, \], l, +, -}. Thus at any time, the 
ith tier may be beginning or ending a constituent ( \[, 
\] ) or both at once ( I ), or it may be in a steady state 
in the interior or exterior of a constituent (+, -). 
At a minimum, the string must record all moments 
where there is an edge on some tier. If all tiers are in 
a steady state, the string need not use any symbols 
to say so. Thus the string encoding is not unique. 
(15) gives an expression for all strings that cor- 
rectly describe the single tier shown. (16) describes 
a two-tier timeline consistent with (15). Note that 
the brackets on the two tiers are ordered with re- 
spect to each other. Timelines like these could be 
assembled morphologically from one or more lexical 
entries (Bird & Ellison, 1994), or produced in the 
course of algorithm (14). 
(15) \]= -*\[+*1+'3-* 
(16) 
(-,->*<\[:,-)<,,->*<+, r><+, +)*(I, +)<+, +)* (+, \])(+,-)*(*, \[)(*, +)*C\], 1) 
We store timeline expressions like (16) as deter- 
ministic FSAs. To reduce the size of these automata, 
it is convenient to label arcs not with individual el- 
ements of El (which is huge) but with subsets of E, 
denoted by predicates. We use conjunctive predi- 
cates where each conjunct lists the allowed symbols 
on a given tier: 
(17) +F, 3cr, \[l+-voi (arc label w/ 3 conjuncts) 
The arc label in (17) is said to mention the tiers 
F, o', voi E Tiers. Such a predicate allows any sym- 
bol from A on the tiers it does not mention. 
The input FSA constrains only the input tiers. In 
(14) we intersect it with Repns, which constrains 
only the output tiers. Repns is defined as the inter- 
section of many automata exactly like (18), called 
tier rules, which ensure that brackets are properly 
paired on a given tier such as F (foot). 
(18) -F ,+F 
Like the tier rules, the constraint automata Ci are 
small and deterministic and can be built automat- 
ically. Every edge has weight O or 1. With some 
care it is possible to draw each Ci with two or fewer 
states, and with a number of arcs proportional to 
the number of tiers mentioned by the constraint. 
Keeping the constraints small is important for ef- 
ficiency, since real languages have many constraints 
that must be intersected. 
Let us do the hardest case first. An implication 
constraint has the general form (9). Suppose that all 
the c~i are interiors, not edges. Then the constraint 
targets intervals of the form a = c~1 f'l c~2 fq • • .. Each 
time such an interval ends without any 3j having 
occurred during it, one violation is counted: 
(19) Weight-1 arcs are shown in bold; others are 
weight-0. (other) 
(other) 
b during a ~/~ 
"-1/ II a ends 
A candidate that does see a #j during an c~ can go 
and rest in the right-hand state for the duration of 
the a. 
Let us fill in the details of (19). How do we detect 
the end of an a? Because one or more of the ai end 
(\], I), while all the al either end or continue (+), so 
that we know we are leaving an a. 5 Thus: 
(20) (in all ai)- (some bj) 
in all ai 
An unusually complex example is shown in (21). 
Note that to preserve the form of the predicates 
in (17) and keep the automaton deterministic, we 
need to split some of the arcs above into multi- 
ple arcs. Each flj gets its own arc, and we must 
also expand set differences into multiple arcs, using 
the scheme W - z A y A z = W V ~(x A y A z) = 
(W A ~x) V (W A z A-~y) V (W A x A y A -~:). 
sit is important to take \], not +, as our indication that 
we have been inside • constituent. This means that the 
timeline ( \[, -)(+, -)*(+, \[)(% +)*('\], +)(-, +)*(-, \]) cannot 
avoid violating a clash constraint simply by instantiat- 
ing the (+, +)* part as e. Furthermore, the \] convention 
means that a zero-width input constituent (more pre- 
cisely, a sequence of zero-width constituents, represented 
as a single 1 symbol) will often act as if it has an interior. 
Thus if V syncopates as in footnote 2, it still violates the 
parse constraint _V -- V. This is an explicit property of 
OTP: otherwise, nothing that failed to parse would ever 
violate PARSE, because it would be gone! 
On the other hand, "l does not have this special role 
on the right hand side of ---+ , which does not quantify 
universally over an interval. The consequence for zero- 
width consituents is that even if a zero-width 1/_" overlaps 
(at the edge, say) with a surface V, the latter cannot 
claim on this basis alone to satisfy FILL: V -- V__. This 
too seems like the right move linguistically, although fur- 
ther study is needed. 
317 
(21) (pandq) --* (borc\[) 
+p +q \[\]l-b \]+-c 
How about other cases? If the antecedent of 
an implication is not. an interval, then the con- 
straint needs only one state, to penalize mo- 
ments when the antecedent holds and the con- 
sequent does not. Finally, a clash constraint 
cq I a2 _1_ ... is identical to the implication 
constraint (or1 and a.~ and...) --* FALSE. Clash 
FSAs are therefore just degenerate versions of im- 
plication FSAs, where the arcs looking for/3j do not 
exist because they would accept no symbol. (22) 
shows the constraints (p and \]q ) --+ b and p 3_ q. 
(22) +p +q 
4 Computational requirements 
4.1 Generalized Alignment is not flnite-state 
Ellison's method can succeed only on a restricted 
formalism such as OTP, which does not admit such 
constraints as the popular Generalized Alignment 
(GA) family of (McCarthy & Prince, 1993). A typ- 
ical GA constraint is ALIGN(F, L, Word, L), which 
sums the number of syllables between each left foot 
edge F\[ and the left edge of the prosodic word. Min- 
imizing this sum achieves a kind of left-to-right it- 
erative footing. OTP argues that such non-local, 
arithmetic constraints can generally be eliminated 
in favor of simpler mechanisms (Eisner, in press). 
Ellison's method cannot directly express the above 
GA constraint, even outside OTP, because it cannot 
compute a quadratic function 0 + 2 + 4 + -.. on a 
string like \[~cr\]F \[~a\]r \[~\]r '" '. Path weights in an 
FSA cannot be more than linear on string length. 
Perhaps the filtering operation of any GA con- 
straint can be simulated with a system of finite- 
state constraints? No: GA is simply too powerful. 
The proof is suppressed here for reasons of space, 
but it relies on a form of the pumping lemma for 
weighted FSAs. The key insight is that among can- 
didates with a fixed number of syllables and a single 
(floating) tone, ALIGN(a, L, H, L) prefers candidates 
where the tone docks at the center. A similar argu- 
ment for weighted CFGs (using two tones) shows this 
constraint to be too hard even for (Tesar, 1996). 
4.2 Generation is NP-complete even in OTP 
When algorithm (14) is implemented literally and 
with moderate care, using an optimizing C compiler 
on a 167MHz UltraSPARC, it takes fully 3.5 minutes 
(real time) to discover a stress pattern for the syl- 
lable sequence ~.6 The automata 
become impractically huge due to intersections. 
Much of the explosion in this case is introduced 
at the start and can be avoided. Because Repns 
has 21Tiersl = 512 states, So, $1, and $2 each 
have about 5000 states and 500,000 to 775,000 arcs. 
Thereafter the S~ automata become smaller, thanks 
to the pruning performed at each step by BestPaths. 
This repeated pruning is already an improvement 
over Ellison's original algorithm (which saves prun- 
ing till the end, and so continues to grow exponen- 
tially with every new constraint). If we modify (14) 
further, so that each tier rule from Repns is inter- 
sected with the candidate set only when its tier is 
first mentioned by a constraint, then the automata 
are pruned back as quickly as they grow. They have 
about 10 times fewer states and 100 times fewer arcs. 
and the generation time drops to 2.2 seconds. 
This is a key practical trick. But neither it nor 
any other trick can help for all grammars, for in the 
worst case, the OTP generation problem is NP-hard 
on the number of tiers used by the grammar. The 
locality of constraints does not save us here. Many 
NP-complete problems, such as graph coloring or 
bin packing, attempt to minimize some global count 
subject to numerous local restrictions. In the case of 
OTP generation, the global count to minimize is the 
degree of violation of Ci, and the local restrictions 
are imposed by C1, C2,... Ci-1. 
Proof of NP-hardness (by polytime reduction 
from Hamilton Path). Given G = (V(G), E(G)), 
an n-vertex directed graph. Put Tiers = V(G)tO 
{Stem, S}. Consider the following vector of O(n -~) 
primitive constraints (ordered as shown): 
(23) a. VveV(a): ~\[-~s\[ 
b. Vv E V(G): \]~ -- \]s 
c. Vv e V(G): St-era -~ v 
d. Stem .1_ S 
e. Vu, ve V(G) s.t. uv ~ E(G): \]u .L o\[ 
f. Is - 
SThe grammar is taken from the OTP stress typol- 
ogy proposed by (Eisner, in press). It has tier rules for 9 
tiers, and then spends 26 constraints on obvious univer- 
sal properties of morns and syllables, followed by 6 con- 
straints for universal properties of feet and stress marks 
and finally 6 substantive constraints that can be freely 
reranked to yield different stress systems, such as left-to- 
right iambs with iambic lengthening. 
318 
Suppose the input is simply \[Stem\]. Filtering 
Gen(input) through constraints (23a-d), we are left 
with just those candidates where Stem bears n (dis- 
joint) constituents of type S, each coextensive with 
a constituent bearing a different label v E V(G). 
(These candidates satisfy (23a-c) but violate (23d) 
n times.) (23e) says that a chain of abutting con- 
stituents \[uIvIw\]. • • is allowed only if it corresponds 
to a path in G. Finally, (23f) forces the grammar to 
minimize the number of such chains. If the minimum 
is 1 (i.e., an arbitrarily selected output candidate vi- 
olates (23f) only once), then G has a Hamilton path. 
When confronted with this pathological case, the 
finite:state methods respond essentially by enumer- 
ating all possible permutations of V(G) (though 
with sharing of prefixes). The machine state stores, 
among other things, the subset of V(G) that has al- 
ready been seen; so there are at least 2 ITiersl states. 
It must be emphasized that if the grammar is 
fixed in advance, algorithm (14) is close to linear 
in the size of the input form: it is dominated by 
a constant number of calls to Dijkstra's BestPaths 
method, each taking time O(\[input arcs\[ log \[input 
statesl). There are nonetheless three reasons why 
the above result is important. (a) It raises the prac- 
tical specter of huge constant factors (> 2 4°) for real 
grammars. Even if a fixed grammar can somehow be 
compiled into a fast form for use with many inputs, 
the compilation itself will have to deal with this con- 
stant factor. (b) The result has the interesting im- 
plication that candidate sets can arise that cannot 
be concisely represented with FSAs. For if all Si 
were polynomial-sized in (14), the algorithm would 
run in polynomial time. (c) Finally, the grammar 
is not fixed in all circumstances: both linguists and 
children crucially experiment with different theories. 
4.3 Work in progress: Factored automata 
The previous section gave a useful trick for speeding 
up Ellison's algorithm in the typical case. We are 
currently experimenting with additional improve- 
ments along the same lines, which attempt to de- 
fer intersection by keeping tiers separate as long as 
possible. 
The idea is to represent the candidate set S/not as 
a large unweighted FSA, but rather as a collection A 
of preferably small unweighted FSAs, called factors, 
each of which mentions as few tiers as possible. This 
collection, called a factored automaton, serves as 
a compact representation of hA. It usually has far 
fewer states than 71.,4 would if the intersection were 
carried out. 
For instance, the natural factors of So are input 
and all the tier rules (see 18). This requires only 
O(\[Tiers\[ + \[input\[) states, not O(21Tiersl. \[input\[). 
Using factored automata helps Ellison's algorithm 
(14) in several ways: 
• The candidate sets Si tend to be represented 
more compactly. 
• In (14), the constraint Ci+l needs to be inter- 
sected with only certain factors of Si. 
• Sometimes Ci+l does not need to be intersected 
with the input, because they do not mention 
any of the same tiers. Then step i + 1 can be 
performed in time independent of input length. 
Example: input = , which is 
a 43-state automaton, and C1 is F -- x, which says 
that every foot bears a stress mark. Then to find 
$1 = BestPaths(S0 71 C1), we need only consider 
S0's tier rules for F and x, which require well-formed 
feet and well-formed stress marks, and combine them 
with C1 to get a new factor that requires stressed 
feet. No other factors need be involved. 
The key operation in (14) is to find Bestpaths(A 71 
C), where .4 is an unweighted factored automaton 
and C is an ordinary weighted FSA (a constraint). 
This is the best intersection problem. For con- 
creteness let us suppose that C encodes F ---* x, a 
two-state constraint. 
A naive idea is simply to add F ---* x to ..4 as 
a new factor. However, this ignores the BestPaths 
step: we wish to keep just the best paths in r\[ ~ x\[ 
that are compatible with A. Such paths might be 
long and include cycles in F\[ ---* x\[. For example, 
a weight-1 path would describe a chain of optimal 
stressed feet interrupted by a single unstressed one 
where A happens to block stress. 
A corrected variant is to put I -- 71.A and run 
BestPaths on I 71 C. Let the pruned result be B. 
We could add B directly back to to ,4 as a new 
factor, but it is large. We would rather add a smaller 
factor B' that has the same effect, in that 1 71 B' = 
1 71 B. (B' will look something like the original C, 
but with some paths missing, some states split, and 
some cycles unrolled.) Observe that each state of B 
has the form i x c for some i E I and c E C. We 
form B' from B by "re-merging" states i x c and 
i' x c where possible, using an approach similar to 
DFA minimization. 
Of course, this variant is not very efficient, because 
it requires us to find and use I = N.4. What we 
really want is to follow the above idea but use a 
smaller I, one that considers just the relevant factors 
in .,4. We need not consider factors that will not 
affect the choice of paths in C above. 
Various approaches are possible for choosing such 
an I. The following technique is completely general, 
though it may or may not be practical. 
Observe that for BestPaths to do the correct 
thing, I needs to reflect the sum total of .A's con- 
straints on F and x, the tiers that C mentions. More 
formally, we want I to be the projection of the can- 
didate set N.A onto just the F and x tiers. Unfortu- 
nately, these constraints are not just reflected in the 
factors mentioning F or x, since the allowed con- 
figurations of F and x may be mediated through 
319 
additional factors. As an example, there may be a 
factor mentioning F and ¢, some of whose paths are 
incompatible with the input factor, because the lat- 
ter allows ¢ only in certain places or because only 
allows paths of length 14. 
1. Number the tiers such that F and x are num- 
bered 0, and all other tiers have distinct positive 
numbers. 
2. Partition the factors of .4 into lists L0, L1, 
L2,... Lk, according to the highest-numbered 
tier they mention. (Any factor that mentions 
no tiers at all goes onto L0.) 
3. If k -- 0, then return MLk as our desired I. 
4. Otherwise, MLk exhausts tier k's ability to me- 
diate relations among the factors. Modify the 
arc labels of ML} so that they no longer restrict 
(mention) k. Then add a determinized, mini- 
mized version of the result to to Lj, where j is 
the highest-numbered tier it now mentions. 
5. Decrement k and return to step 3. 
If n.4 has k factors, this technique must per- 
form k - 1 intersections, just as if we had put 
I = n.4. However, it intersperses the intersections 
with determinization and minimization operations, 
so that the automata being intersected tend not 
to be large. In the best case, we will have k- 
1 intersection-determinization-minimizations that 
cost O(1) apiece, rather than k-1 intersections that 
cost up to 0(2 k) apiece. 
5 Conclusions 
Primitive Optimality Theory, or OTP, is an attempt 
to produce a a simple, rigorous, constraint-based 
model of phonology that is closely fitted to the needs 
of working linguists. I believe it is worth study both 
as a hypothesis about Universal Grammar and as a 
formal object. 
The present paper introduces the OTP formal- 
ization to the computational linguistics community. 
We have seen two formal results of interest, both 
having to do with generation of surface forms: 
• OTP's generative power is low: finite-state 
optimization. In particular it is more con- 
strained than theories using Generalized Align- 
ment. This is good news for comprehension and 
learning. 
• OTP's computational complexity, for genera- 
tion, is nonetheless high: NP-complete on the 
size of the grammar. This is mildly unfortunate 
for OTP and for the OT approach in general. 
It remains true that for a fixed grammar, the 
time to do generation is close to linear on the 
size of the input (Ellison, 1994), which is heart- 
ening if we intend to optimize long utterances 
with respect to a fixed phonology. 
Finally, we have considered the prospect of building 
a practical tool to generate optimal outputs from 
OT theories. We saw above to set up the represen- 
tations and constraints efficiently using determinis- 
tic finite-state automata, and how to remedy some 
hidden inefficiencies in the seminal work of (Elli- 
son, 1994), achieving at least a 100-fold observed 
speedup. Delayed intersection and aggressive prun- 
ing prove to be important. Aggressive minimization 
and a more compact. "factored" representation of 
automata may also turn out to help. 

References 
Bird, Steven, &: T. Mark Ellison. One Level Phonol- 
ogy: Autosegmental representations and rules as 
finite automata. Comp. Linguistics 20:55-90. 
Cole, Jennifer, ~z Charles Kisseberth. 1994. An op- 
timal domains theory of harmony. Studies in the 
Linguistic Sciences 24: 2. 
Eisner, Jason. In press. Decomposing FootForm: 
Primitive constraints in OT. Proceedings of SCIL 
8, NYU. Published by MIT Working Papers. 
(Available at http://ruccs.rutgers.edu/roa.html.) 
Eisner, Jason. What constraints should OT allow? 
Handout for talk at LSA, Chicago. (Available at 
http://ruccs.rutgers.edu/roa.html.) 
Ellison, T. Mark. Phonological derivation in opti- 
mality theory. COLING '94, 100%1013. 
Goldsmith, John. 1976. Autosegmental phonology. 
Cambridge, Mass: MIT PhD. dissertation. Pub- 
lished 1979 by New York: Garland Press. 
Goldsmith, John. i990. Autosegmental and metrical 
phonology. Oxford: Blackwell Publishers. 
McCarthy, John, & Alan Prince. 1993. General- 
ized alignment. Yearbook of Morphology, ed. Geert 
Booij & 3aap van Marle, pp. 79-153. Kluwer. 
McCarthy, John and Alan Prince. 1995. Faithful- 
ness and reduplicative identity. In Jill Beckman 
et al., eds., Papers in Optimality Theory. UMass. 
Amherst: GLSA. 259-384. 
Prince, Alan, & Paul Smolensky. 1993. Optimality 
theory: constrainl interaction in generative gram- 
mar. Technical Reports of the Rutgers University 
Center for Cognitive Science. 
Selkirk, Elizabeth. 1980. Prosodic domains in 
phonology: Sanskrit revisited. In Mark Aranoff 
and Mary-Louise Kean, eds., Juncture, pp. 107- 
129. Anna Libri, Saratoga, CA. 
Tesar, Bruce. 1995. Computational Optimality The- 
ory. Ph.D. dissertation, U. of Colorado, Boulder. 
Tesar, Bruce. 1996. Computing optimal descriptions 
for Optimality Theory: Grammars with context- 
free position structures. Proceedings of the 34th 
Annual Meeting of the ACL. 
