Non-deterministic Recursive Ascent Parsing 
Ren~ Leermakers 
Philips Research Laboratories, 
P.O. Box 80.000, 5600 JA Eindhoven, The Netherlands 
E-mail:leermake@rosetta.prl.philips.nl 
ABSTRACT 
A purely functional implementation of LR-parsers is 
given, together with a simple correctness proof. It is 
presented as a generalization of the recursive descent 
parser. For non-LR grammars the time-complexity of 
our parser is cubic if the functions that constitute the 
parser are implemented as memo-functions, i.e. func- 
tions that memorize the results of previous invocations. 
Memo-functions also facilitate a simple way to construct 
a very compact representation of the parse forest. For 
LR(0) grammars, our algorithm is closely related to the 
recursive ascent parsers recently discovered by Kruse- 
man Aretz \[1\] and Roberts \[2\]. Extended CF grammars 
(grammars with regular expressions at the right hand 
side) can be parsed with a simple modification of the 
LR-parser for normal CF grammars. 
1 Introduction 
In this paper we give a purely functional implementa- 
tion of LR-parsers, applicable to general CF grammars. 
It will be obtained as a generalization of the well-known 
recursive descent parsing technique. For LR(0) gram- 
mars, our result implies a deterministic parser that is 
closely related to the recursive ascent parsers discovered 
by Kruseman Aretz \[1\] and Roberts \[2\]. In the gen- 
eral non-deterministic case, the parser has cubic time 
complexity if the parse functions are implemented as 
memo-functions \[3\], which are functions that memorize 
and re-use the results of previous invocations. Memo- 
functions are easily implemented in most programming 
languages. The notion of memo-functions is also used 
to define an algorithm that constructs a cubic represen- 
tation for the parse forest, i.e. the collection of parse 
trees. 
It has been claimed by Tomita that non-deterministic 
LR-parsers are useful for natural language processing. 
In \[4\] he presented a discussion about how to do non- 
deterministic LR-parsing, with a device called a graph- 
structured stack. With our parser we show that no ex- 
plicit stack manipulations are needed; they can be ex- 
pressed implicitly with the use of appropriate program- 
ming language concepts. 
Most textbooks on parsing do not include proper 
correctness proofs for LR-parsers, mainly because such 
proofs tend to be rather involved. The theory of LR- 
parsing should still be considered underdeveloped, for 
this reason. Our presentation, however, contains a sur- 
prisingly simple correctness proof. In fact, this proof is 
this paper's major contribution to parsing theory. One 
of its lessons is that the CF grammar class is often the 
natural one to proof parsers for, even if these parsers are 
devoted to some special class of grammars. If the gram- 
marlis restricted in some way, a parser for general CF 
grammars may have properties that enable smart imple- 
mentation tricks to enhance efficiency. As we show be- 
low, the relation between LR-parsers and LR-grammars 
is of this kind. 
Especially in natural language processing, standard 
CF grammars are often too limited in their strong gen- 
erative power. The extended CF grammar formalism, 
allowing rules to have regular expressions at the right 
hand side, is a useful extension, for that reason. It is not 
difficult to generalize our parser to cope with extended 
grammars, although the application of LR-parsing to 
extended CF grammars is well-known to be problematic 
\[5\]. 
We first present the recursive descent recognizer in 
a way that allows the desired generalization. Then we 
obtain the recursive ascent recognizer and its proof. If 
the grammar is LR(0) a few implementation tricks lead 
to the recursive ascent recognizer of ref. \[1\]. Subse- 
quently, the time and space complexities of the recog- 
nizer are analysed, and the algorithm for constructing 
a cubic representation for parse forests is given. The 
paper ends with a discussion of extended CF grammars. 
2 Recursive descent 
Consider CF grammar G, with terminals VT and non- 
terminals V/v. Let V = VN U VT. A well-known top- 
down parsing technique is the recursive descent parser. 
Recursive descent parsers consist of a number of pro- 
cedures, usually one for each non-terminal. Here we 
present a variant that consists of functions, one for each 
item (dotted rule). We use the unorthodox embracing 
operator \[.\] to map each item to its function: (we use 
greek letters for arbitrary elements of V*) 
\[A -~ a.~\] : N --. 2 N 
where N is the set of integers, or a subset (0...nm~x), 
with nma= the maximum seutence length. The functions 
are to meet the following specification: 
\[A --, a. l(0 = {Jl  -*" 
- 63 - 
with x~...xn the sentence to be parsed. A recursive im- 
plementation for these functions is given by (b • VT, B • v,,) 
\[A --* a.\](i) = {i} 
\[a --* a.b-r\](i) = {jib = zi+, ^ j E \[A ~ ab.-r\](i + 1)} 
\[A ---, a.B-r\](i) = 
{Jl~ • \[B ~ .~\](i)^j • \[A -; ~a.-r\](~)} 
We keep to the custom of omitting existential quantifi- 
cation (here for k,/f) in definitions of this kind. 
The proof is elementary and ba#ed on 
3~(3 = x~+a-r A -r ~* zi+~...~:s)V 
3B-~$k(3 = B-r ^ B --~ 8 A 8 --;* zi+a...x~^ 
-r --~* 2;k+l...2~j) 
If we add a grammar rule S' --* S to G, with S' (\[ V 
then S --** x~...xn is equivalent to n • \[S' --* .S\](0). 
The recursive descent recognizer works for any CF 
grammar except for grammars for which ~A~(A ---* 
aAcr --** A3). For such left-recursive grammars the rec- 
ognizer does not terminate, as execution of \[A --* .a\](i) 
will lead to a call of itself. The recognition is not a linear 
process in general: the function calls \[A --- a.B3"\](i) lead 
to calls \[B --* ./i\](i) for all values of ~ such that B ---, 
is a grammar rule. 
3 The ascent recognizer 
One way to make the recognizer more deterministic is by 
combining functions corresponding to a number of com- 
peting items into one function. Let the set of all items 
of G be given by In. Subsets of I6; are called states, and 
we use q to be an arbitrary state, lWe associate to each 
state q a function, re-using the above operator \[.\], 
\[q\] : N ~ 2 I°×N 
that meets the specification 
\[q\](i) ---- {(A -- a.3,j)l A --. a.3 • q ^ 3 --*" zi+~...xi} 
As above, the function reports which parts of the sen- 
tence can be derived. But as the function is associated 
to a set q of items, it has to do so for each item in 
q. If we define the initial state q0 = {S' --* .S}, now 
S --," xl...xn is equivalent to (S' ---* .S,n) • \[q0\](0). 
Before proceeding, we need a couple of definitions. 
Let ini(q) be the set of initial items for state q, that 
are derived from q by the closure operation: 
ini(q) = { B --* .AIB -. A ^ A --* a.3 • q A 3 =¢ B-r}. 
The double arrow =¢, denotes a left-most-symbol rewrit- 
ing Ba =e~ Cfla, using a non-e rule B ---, Cfl. The 
transition function goto is defined by (B • V) 
goto(q, B) = {A -* aB.3\]A --* a.B3 • (q U ini(q))} 
Also define 
pop(A ---, aB.3) = A --', a.B3 
lhs(A --* a.fl) = A 
final(A --. a.3) = (131 = 0) 
with B E V, and 1/31 the number of symbols in 3 (with 
H = 0). A recursive ascent recognizer may be obtained 
by relating to each state q not only the above \[q\], but 
also a function__ that we take to be the result of applying 
operator \[.\] to the state: 
\[q\] : V x N --* 2 I° xN 
It has the specification 
\[q\](B,i) = {(A --* a.3,j)lA --* a.3 e qA 
3 =~* B-rAT ---,* xi+a...xj} 
For i >.nn (n is the sentence length) it follows that 
\[q\](i) = \[q\](B,i) = $, whereas for i _< n the functions 
are recursively implemented by 
\[q\](i) = \[q\](x,+l, i + 1)u 
{(1, j)IB --* .e e ini(q)A (l, j) E-~(B, i)}U 
{(l,i)lI • q ^ final(l)} 
\[q\](B, i) = { (pop(l), J)l 
(1,j) • \[ooto(q, B)\](i)^ pop(l) • q}U 
{(I,4)1(J, k) • \[goto(q, B)I~^ 
pop(J) • ini(q) ^ (1, j) • \[q\](lhs(S), k)} 
Proof: 
First we notice that 
/8 "** xi+l..-xj 
3~(3 ~* zi+l't ^ 7 ~" z,+2...zj)v 
3B~(3 ~" B-r ^ B ~ c ^ -y --." z,+~...zj)v 
(~=~^i=j) 
Hence 
\[q\](i) = 
{(A --* a.3,J)l(A --* a.3, j) • r~(z,+~, i+ 1)}u 
{(A --, ,~.3, J)l 
B -.-. eA(A --, a.3,j) • \[q\](B,i)}u 
{(A --~ a.,i)la --* a. • q} 
This is equivalent to the earlier version because we may 
replace the clause B ~ e by B ---, .e • ini(q). Indeed, 
if state q has item A --* a.fl and if there is a left-most- 
symbol derivation/3 =~* B-r then all items B --* .A are 
included in ini(q). 
For establishing the correctness of \[q--\] notice that 
3 ~* B3" either contains zero steps, in which case 
3 = B'r, or it contains at least one step: 
3.y(3 =~* B3" A 3' --*" xi+a ...zs) = 
3~(3 = B-r ^ -r --" xi+l...zj)V 
3ce.~k(~8 :=~* C-rAG --* B~S A~5 -.*" xi+ l...x~,A -r -'** 
xk+l ...x j) 
Hence \[q\](B, i) may be written as the union of two sets, 
\[q\](B, i) = So USa: 
So = {(A --~ a.B3",j)\] 
A ---. ct.B3" • qA-r ---** xs+l...xj} 
S~ = {(a --. a.3,j)lA --* a.3 • q ^ 3 =~" C-r^ 
C ---* B~ ^ $ --** zi+l...xk ^ 3' --*" zk+l...zi}. 
By the definition of goto, if A ---, a.B-r • q then A --, 
aB.-r • goto(q, B). tlence, with the specification of \[q\], 
So may be rewritten as 
So = {(A --. a.B-r,j)IA --. a.B-r • q^ 
(A ---* aB.3",j) • \[goto(q, B)\](i)} 
- 64 - 
The set $1 may be rewritten using the specification of 
\[q\](C, k): 
S1 : {(A -'~ a.~,j)l(A -~ a.~,j) E \[q\](C,k)A 
C --* B6 A 6 --," xi+,...xk}. 
Also, as before, ~ =~* C'r implies that all items C ~ .g 
are in ini(q), and the existence of C -* .B~ in ini(q) 
implies C ~ B.~ E goto(q, B): 
Sx = {(A ~ a.~,j)l(A ~ ~.B,j) E \[q\](C, k)A 
C --~ .B~ E ini(q)A 
(C --. B.6, k) ~ \[goto(q, B)\](i)}. 
n 
In the computation of \[q0\](0), functions are needed 
only for states in the canonical collection of LR(0) states 
\[6\] for G, i.e. for every state that can be reached from the 
initial state by repeated application of the goto function. 
Note that in general the state ¢ will be among these, and 
that both \[¢\](i) and \[g\](B, i) are empty sets for all i _> 0 
and B E V. 
4 Deterministic variants 
One can prove that, if the grammar is LR(0), each rec- 
ognizer function for a canonical LR(0) state results in 
a set with at most one element. The functions for non- 
empty q may in this case be rephrased as 
\[q\](i): 
if, for some I, I E q A final(l) t__hen return {(I, i)} else 
if B --..e E ini(q) then ret__.urn \[q\](B, i) 
else if i < n then return \[q\](xi+~, i + 1) 
else return 
fi 
\[q\](B,i): 
if \[9oto(q, B)\](i) = ¢ then return ~ else 
let (I, j) be the unique element of \[goto(q, B)\](i). Then: 
if pop(I) E q then return {(pop(l), j)} 
else return \[q\](Ihs(l), j) 
fl 
fi 
Reversely, the implementations of \[q\](i) and \[q\](B,i) of 
the previous section can be seen as non-deterministic 
versions of the present formulation, which therefore pro- 
vides an intuitive picture that may be helpful to under- 
stand the non-deterministic parsing process in an oper- 
ational way. 
Each function can be replaced by a procedure that, 
instead of returning a function result, assigns the result 
to a global (set) variable. As this set variable may con- 
tain at most one element, it can be represented by three 
variables, a boolean b, an item R and an integer i. If 
a function would have resulted in the set {(I,j)}, the 
global variables are set to b = TRUE, R = I and i = j. 
A function value ~ is represented by b = FALSE. Also 
the arguments of the functions are superfluous now. The 
rble of argument i can be played by the global variable 
with the same na__.rne, and lhs(R)can be used instead of 
argument B of \[q\]. Consequently, procedure \[¢\] becomes 
a statement b := FALSE, whereas for non-emp.~, q one 
gets the procedures (keeping the names \[q\] and \[q\], trust- 
ing no confusion will arise): 
\[q\] : 
if, for some I, I E q A final(l) then R := I 
else if B --..¢ E ini(q) then R := B -- e.; \[q\] __ 
else if i < n then R := xi+a -- xi+l.; i := i + 1; \[q\] 
else b := FALSE 
fi 
N M: 
\[goto(q, Ihs(R))l; 
if b. then 
if pop(R) E q then R := pop(R) 
.else \[q\] 
fi 
fi 
Note that these procedures do not depend on the details 
of the right hand side of R. Only the number of sym- 
bols before the dot is relevant for the test "pop(R) E q". 
Therefore, R can be replaced by two variables X E V 
and an integer I, making the following substitutions in 
the previous procedures: 
R:=A--*a. =~ X:=A;I:=Icrl 
R:=pop(R) =~ l := l-1 
pop(R) E q =~ l # l v X = S' 
lhs( R) =~ X 
After these substitutions, one gets close to the recursive 
ascent recognizer as it was presented in \[1\]. A recognizer 
that is virtually the same as in \[l~s obtained by replac- 
ing the tail-recursive procedure \[q\] by an iterative loop. 
Then one is left with one procedure for each state. While 
parsing there is, at each instance, a stack of activated 
procedures that corresponds to the stacks that are ex- 
plicitly maintained in conventional implementations of 
deterministic LR-parsers. 
5 Complexity 
For LL(0) grammars the recursive descent recognizer is 
deterministic and works in linear time. The same is 
true of the ascent recognizer for LR(0) grammars. In 
the general, non-deterministic, case the recursive de- 
scent and ascent recognizers need exponential time un- 
less the functions are implemented as memo-functions 
\[3\]. Memo-functions memorize for which arguments they 
have been called. If a function is called with the same 
arguments as before, the function returns the previous 
result without recomputing it. In conventional program- 
ming languages memo-functions are not available, but 
they can easily be implemented. Devices like graph- 
structured stacks \[4\], parse matrices \[7\], or welbformed 
- 65 - 
substring tables \[8\], are in fact low-level realizations of 
the abstract notion of memo-functions. The complex- 
ity analysis of the recognizers is quite simple. There are 
O(n) different invocations of parser functions. The func- 
tions call at most O(n) other functions, that all result 
in a set with O(n) elements (note that there exist only 
O(n) pairs (I, j) with I E IG, i _< j _< n). Merging these 
sets to one set with no duplicates can be accomplished in 
O(n 2) time on a random access machine. Hence, the to- 
tal time-complexity is O(na). The space needed for stor- 
ing function results is O(n) per invocation, i.e. O(n 2) 
for the whole recognizer. 
The above considerations only hold if the parser ter- 
minates. The recursive descent parser terminates for all 
grammars that are not left-recursive. For the recursive 
ascent parser, the situation is more complicated. If the 
gra_m.mmar has a cyclic derivation B -** B, the execution 
of \[q\](B, i) leads to a call of itself. Also, there may be a 
cycle of transitions labeled by non-terminals that derive 
e, e.g. if goto(q, B) = q A B ---, e, so that the execution 
of \[q\](i) leads to a call of itself. There are non-cyclic 
grammars that suffer from such a cycle (e.g. S --* SSb, 
S --* e). Hence, the ascent parser does not terminate if 
the grammar is cyclic or if it leads to a cycle of transi- 
tions labeled b_.~ non-terminals that derive e. Otherwise, 
execution of \[q\](B, i) can only lead to calls of \[p\](i) with 
p ~ q and to calls of \[q\](C,k), such that either k > i 
or C--** BAC ~ B. As there are only finitely many 
such p, C, the parser terminates. Note that both the re- 
cursive descent and ascent recognizer terminate for any 
grammar, if the recognizer functions are implemented 
as memo-functions with the property that a call of a 
function with some arguments yields $ while it is under 
execution. For instance, if execution of \[q\](i) leads to 
a call of itself, the second call is to yield ~. A remark 
of this kind, for the recursive descent parser, was first 
made in ref. \[8\]. The recursive descent parser then be- 
comes virtually equivalent to a version of the standard 
Earley algorithm \[9\] that stores items A ---* a./~ in parse 
matrix entry Ti i if/~ ---,* xi+l...xi, instead of storing it 
if a --*° x~+l...xj. 
The space required for a parser that also calculates 
a parse forest, is dominated by this forest. We show 
in the next section that it may be compressed into a 
cubic amount of space. In the complexity domain our 
ascent parser beats its rival, Tomita's parsing method 
\[4\], which is non-polynomial: for each integer k there 
exists a grammar such that the complexity of the Tomita 
parser is worse than n k. 
In addition to the complexity as a function of sen- 
tence length, one may also consider the complexity as 
a function of grammar size. It is clear that both time 
and space complexity are proportional to the number of 
parsing procedures. The number of procedures of the 
recursive descent parser is proportional to the number 
of items, and hence a linear function of the grammar 
size. The recursive ascent parser, however, contains two 
functions for each LR-state and is hence proportional to 
the size of the canonical collection of LR(0) states. In 
the worst case, this size is an exponential function of 
grammar size, but in the average natural language case 
there seems to be a linear, or even sublinear, dependence 
\[4\]. 
6 Parse forest 
Usually, the recognition process is followed by the con- 
struction of parse trees. For ambiguous grammars, it 
becomes an issue how to represent the set of parse trees 
as compactly as possible. Below, we describe how to 
obtain a cubic representation in cubic time. We do so 
in three steps. 
In the first step, we observe that ambiguity often 
arises locally: given a certain context C\[-\], there might 
be several parse subtrees tl...tk (all deriving the same 
substring xi+l...xj from the same symbol A) that fit 
in that same context, leading to the parse trees C\[tl\], 
eft2\] ..... c\[th\] for the given string zl...zn. Instead of 
representing these parse trees separately, repeating each 
time the context C, we can represent them collectively 
as C\[{~1, ..., tk}\]. Of course, this idea should be applied 
recursively. Technically, this leads to a kind of tree-llke 
structure in which each child is a set of substructures 
rather than a single one. 
The sharing of context can be carried one step further. 
If we have, in one and the same context, a number of 
applied occurrences of a production rule A ---, a/~ which 
share also the same parse forest for a, we can represent 
the context of A ---* a~ itself and the common parse 
forest for a only once and fit the set of parse forests for 
fl into that. Again this idea has to be applied recursively. 
Technically, this leads to a binary representation of parse 
trees, with each node having at most two sons, and to 
the application of the context sharing technique to this 
binary representation. 
These two ideas are captured by introducing a func- 
tion f with the interpretation that f(f3, i,j) represents 
the parse forest of all derivations from /~ E V* to 
zi+~...x~, for all i,j such that 0 < i < j < n. The 
following recursive definitions fix the parse forest repre- 
sentation formally: 
f(~, i,j) ={\[l\[i = J}, 
f(a, i, j) = {alj = i + 1 ^ x,+l = a}, for all a e liT, 
f(A,i,j) = {(A,f(ot, i,j))lA ~ aA 
a .---*" xi+l...x~}, for all A E VN, 
f(AB/3, i, j) = {(f(A, i, k), f(B#, k, J))l 
i < k < jAA ---," xi+l...Xk ^ B/~ --~" xk+l...xj}, for 
all A, B E V. 
The representation for the set of parse trees is then just 
f(S, 0, n). 
We now come to our third step. Suppose, for the mo- 
ment, that the guards a ---,* xi+l...xj and the like, oc- 
curring above, can be evaluated in some way or another. 
Then we can use function f to compute the representa- 
tion of the set of parse trees for sentence xl...xn. If we 
make use of memo-functions to avoid repeated compu- 
tation of a function applied to the same arguments, we 
see that there are at most O(n 2) function evaluations. 
- 66 - 
If we represent function values by re\]erences to the set 
representations rather than by the sets themselves, the 
most complicated function evaluation consumes an ad- 
ditional amount of storage that is O(n): for j - i + 1 
values of k we have to perform the construction of a 
pair of (copies of) two references, costing a unit amount 
of storage each. Therefore, the total amount of space 
needed for the representation of all parse trees is O(n3). 
The evaluation of the guards ct ---." xi+l...xj etc. 
amounts exactly to solving a collection of recognition 
problems. Note that a top-down parser is possible 
that merges the recognition and tree-building phases, 
by writing 
f(A,i,j) = {(A,f(ot, i,j))lA -., a A f(a,i,j) # ~}, for 
all A E VN, 
I(AB/i, i, j) = {(f(A, i, k),/(B/i, k, J))l 
i < k < j A f(A,i,k) # ¢ A f(B/i,k,j) # ~}, 
for all A, B E V, 
the other cases for f being left unchanged. Note the sim- 
ilarity between the recognizing part of this algorithm 
and the descent recognizer of section 2. Again, this 
parser is a cubic algorithm if we use memo-functions. 
Another approach is to apply a bottom-up recognizer 
first and derive from it a set P containing triples (/i, i,j) 
only if/3 ---'" xi+l...xj, and at least those triples (/i, i,j) 
for which the guards/3 ---** xi+a ...xj are evaluated dur- 
ing the computation of f(S, O, n) (i.e., for each deriva- 
tion S ---." xl...xkAxj+l...Zn "-* Xl...XkOl/iXj+l...Xn "-'** 
zl...xiflzj+l...xn "~" xl...xn, the triples (/i,i,j) and 
(A,k,j) should be in P). The simplest way to obtain 
such P from our recognizer is to assume an implementa- 
tion of memo-functions that enables access to the mem- 
oized function results, after executing \[q0\](O). Then one 
has the disposal of the set 
{(/i, i,j)l\[q\](i ) was invocated and 
(A --* a./i, j) e \[q\](i)} 
Clearly, (/i,i,j) is only in this set if /i --+" xi+l...x i. 
Note, however, that no pairs (A --~ ./i,j) are included 
in \[q\](i) (except if A = S'). We remedy th__is with a 
slight change of the specifications of \[q\] and \[q\], defining 
~ q U ini(q): 
\[q\](i) = 
{(A --.* a.3, j)lA --~ c~./~ E ~A/i ---** xi+l...xj} 
\[q\](B,i) = {(a ---* a./i,j)lA ---* a./i E "~A 
t3 ~* BT A 7 ""* Xi+l"'Xj} 
A recursive implementation of the recognition functions 
now is 
\[q\](i) = {(I,Y)l(I,j) e \[q\](~+~, i + l\[}.p 
{(l,j)l B --.., e ini(q) A (I,j) E \[q\](B,i)}U 
{(I, i)lI E ~ A final(l)} 
\[q\](B, i) = {(pop(I), J)l(l, J) E \[goto(q, B)\](i)lu 
{(I, j)l(J, k} e \[goto(q, B)I~}A 
pop(J) E ini(q) A (I,j) e \[q\](lhs(J),k)} 
If we define, for this revised recognizer, 
P = {(3, i, j)l\[q\](i) was invocated and 
(A -...~, j) e \[q\](i)}u 
{(A, i, j)l\[q\](i) was invocated and 
(a --, .~,j) e \[q\](i)}u 
{(x~+~,i,i+ DI0 < i < n}, 
it contains all triples that are needed in f(S, O, n), and 
we may write the forest constructing function as 
f(A,i,j) = {(a,f(a,i,j))lA --, a^ (a,i,j) E P}, for 
all A E V~, 
f(AB/i, i, j) ---- {(I(A, i, k), f(B/3, k, J))l 
(A, i, k) e P A (Bit, k, j) e P}, for all A, B e V, 
the other cases for f being left unchanged again. There 
exists a representation of P in quadratic space such that 
the presence or absence of an arbitrary triple can be de- 
cided upon in unit time. As a result, the time complexity 
of f(S, O, n) is cubic. 
7 Extended CF grammars 
An extended CF grammar consists of grammar rules 
with regular expressions at the right hand side. Every 
extended CF grammar can be translated into a normal 
CF grammar by replacing each right hand side by a 
regular (sub)grammar. The strong generative power is 
different from CF grammars, however, as the degree of 
the nodes in a derivation tree is unbounded. To apply 
our recognizer directly to extended grammars, a few of 
the foregoing definitiovs have to be revised. 
As before, a grammar rule is written A --, a, but with 
a now a regular expression with Na symbols (elements 
of V). Defining T + = 1...N,, and Ta = 0...Na, regular 
expression tr can be characterized by 
1. a mapping ¢~ : T~ + ~ V associating a grammar 
symbol to each number. 
2.. a function succo : To --* 2 T+ mapping each num- 
ber to its set of successors. The regular expression 
can start with tile symbols corresponding to the 
numbers in succo(O). 
3. a set a,~ E 2 7`0 of numbers of symbols the regular 
expression can end with. 
Note that 0 is not associated to a symbol in V and is not 
a possible element of succ,,(k). It can be element of a,~ 
though, in which case there is an empty path through 
the regular expression. 
We define an item as a pair (A --, a,k), with the 
interpretation that number k is 'just before the dot'. 
The correspondence with dotted rules is the following. 
Let a = B1...Bt, then a is a simple regular expression 
characterized by ~ba(k) = Bk, succa(k) = {k + 1} if 
0 < k < l, succo(l) = {~, and a,, = {I}. Item (A ---. a,0) 
corresponds to the initial item A ---* .a and (A ---* a, k) 
to the dotted-rule item with the dot just after Bk. 
The predicate final for the new kind of items is defined 
by 
final((A ---* a, k)) = (k E an) 
Given a set q of items, we define 
- 67 - 
ini(q) = {(A -- a,0)l(B ---* fl, l) • qA 
k • s.cc,(0 ^ ¢a(k) ~" A~} 
The function pop becomes set-valued and the transition 
function can be defined in terms of it (remember: ~ = 
q U ini(q)): 
pop((A ~ a, l)) = {(a --. a, k)ll • succ.(k)} 
goto(q, B) = {(a ---, a, k)l*.(k) = B a I • ~A 
I • pop((a --* a, k))} 
A recursive ascent recognizer is now implemented by 
\[q\](i) = \[q\](~ci+l, i + 1)U 
{(I, j)lJe ini(q) ^ final(J)A 
(I,j) • \[q\](lhs(J), i)}U 
{(I, i)ll • q ^ final(\[)) 
\[q\](B,i) = {J,j)lJ • q ^ J • pop(I)^ 
(1, j) • \[goto(q, B)\](i)}U 
t(I,j)l(J, k) • \[goto(q,B)\](i) A K • ini(q)^ 
K • pop(J)^ (l,j) • \[q\](lhs(J), k)} 
The initial state q0 is {(S' ---* S, 0)}, and a sentence 
xl...x, is grammatical if ((S' --* S, 0), n) • \[qo\](O). The 
recognizer is deterministic if 
1. there is no shift-reduce or reduce-reduce conflict, 
i.e. every state has at most one final item, and in 
case it has a final item it has no items (A --, ~,j) 
with k e succ,~(j) A ~b,~(k) • VT. 
2. for all reachable states q, q N ini(q) = ~, and for all 
I there is at most one J • ~ such that J E pop(I). 
In the deterministic case, the analysis of section 4 can be 
repeated with one exception: extended grammar items 
can not be represented by a non-terminal and an integer 
that equals the number of symbols before thc dot, as this 
notion is irrelevant in the case of regular expressions. In 
standard presentations of deterministic LR-parsing this 
leads to almost unsurmountable problems \[5\]. 
8 Conclusions 
We established a very simple and elegant implementa- 
tion of LR(0) parsing. It is easily extended to LALR(k) 
parsing by letting the functions \[q\] produce pairs with 
final items only after inspection of the next k input sym- 
bols. 
The functional LR-parser provides a high-level view of 
LR-parsing, compared to conventional implementations. 
A case in point is the ubiquitous stack, that simply cor- 
responds to the procedure stack in the functional case. 
As the proof of a functional LR-parser is not hindered 
by unnecessary implementation details, it can be very 
compact. Nevertheless, the functional implementation 
is as efficient as conventional ones. Also, the notion of 
memo-functions is an important primitive for present- 
ing algorithms at a level of abstraction that can not 
be achieved without them, as is exemplified by this pa- 
per's presentation of both the recognizers and the parse 
forests. 
For non-LR grammars, there is no reason to use 
the complicated Tomita algorithm. If indeed non- 
deterministic LR-parsers beat the Earley algorithm for 
some natural language grammars, as claimed in \[4\], this 
is because the number of LR(0) states may be smaller 
than the size of IG for such grammars. Evidently, for the 
grammars examined in \[4\] this advantage compensates 
the loss of efficiency caused by the non-polynomiality 
of Tomita's algorithm. The present algorithm seems to 
have the possible advantage of Tomita's parser, while 
being polynomial. 
Acknowledgement 
A considerable part of this research was done in collabo- 
ration with Lex Augusteyn and Frans Kruseman Aretz. 
Both are colleagues at Philips Research. 
References 
1 F.E.J. Kruseman Aretz, On a recursive ascent parser, 
In\]ormation Processing Letters (1988) 29:201-206. 
2 G.H. Roberts, Recursive Ascent: An LR Analog 
to Recursive Descent, SIGPLAN Notices (1988) 
23(8):23-29. 
3 J. Hughes, Lazy Memo-Functions in Functional Pro- 
gramming Languages and Computer Architecture 
edited by J.-P. Jouannaud, Springer Lecture Notes 
in Computer Science (1985) 201. 
4 M. Tomita, Efficient Parsing \]or Natural Language 
(Kluwer Academic Publishers, 1986). 
5 P.W. Purdorn and C.A. Brown, Parsing extended 
LR(k) grammars, Acta lnformatica (1981) 15:115- 
127. 
6 A.V. Aho and J D. Ullman, Principles of Compiler 
Design (Addison-Wesley publishing company,1977) 
7 A,V. Aho and J.D. Ulhnan, The theory o\] parsing, 
translation, and compiling (Prentice Hall Inc. En- 
glewood Cliffs N.J.,1972). 
8 B.A. Shell. Observations on Context Free Parsing in 
Statistical Methods in Linguistics (Stockhohn (Swe- 
den) 1976). 
Also: Technical Report TR 12-76, Center for Re- 
search in Computing Technology, Aiken Computa- 
tion Laboratory, Harvard Univ., Cambridge (Mas- 
sachusetts). 
9 J. Earley, 1970. An Efficient Context-Free Parsing 
Algorithm, Communications ACM 13(2):94-102. 
- 68 - 
