An ENcient Parser Gener;ttor fl)r Nat;rea,1 Language 
Masayuki ISItlI* l(azuhisa OHTA Iliro~dd SAITO 
Fujitsu Inc. Apple Technology, Inc. Keio University 
masayuki~.nak.math.keio.ac.j p k-ohta@kol)o.apple.coln hxs~nak.math.keio.ac.jp 
Abstract 
We. have developed a parser generator for natu- 
ral language i)rocessing. The generator named 
"NLyace" accepts grammar rules written in the 
Yacc format. NLyacc, unlike Yacc, can handle 
arbitrary context-free grammars using the gen- 
eralized Lll. parsing Mgorithm. The parser pro- 
duced by NLyacc elliciently parses given sen- 
tences and executes semantic actions. NLyacc, 
which is a free and sharable software, runs on 
UNIX workstations and personal computers. 
1 Parser Generator for NLP 
Yacc\[4\] was designed for unambiguous progl'anl- 
ming languages. Thus, Yacc cat) not elegantly 
handle a script language with a natural lan- 
guage flavor, i.e. Yacc forces a grammar writer 
to use tricks for handling ambiguities. To rem- 
edy this situation we have developed Nl,yacc 
which can handle arbitrary context-fi'ee gr;tnl- 
mars t and allows a grammar writer to write 
natural rules and semantic actions. Although 
there are several parsing algorithms for a gen- 
eral context-fi'ee language, such as ATN, CYI(, 
and garley, "the generalized Eli. parsing algo- 
rithm \[2\]" would be the best in terms of its 
compatibility with Yacc and its efficiency. 
An ambiguous grammar causes a conflict in 
the parsing table, a state which has more than 
one action in an entry. The. generalized LR 
parsing proceeds exactly tit(.' same way as the 
stm~dard one except when it encounters a con- 
flict. The standard deterministic LR parser 
chooses only one action in this situation. The 
generalized I,R parser, on the other hand, per- 
forms all the actions in the multiple entry by 
*This work was done while lshil stayed at l)ept, of 
Computer Science, Keio University, Japan. 
1To be exact, NLyacc ca,t not handle ;t circular rule 
like "A --+ A". 
splitting the parse stack fin' each action. The 
parser merges the divided sta.ck br;tnches, only 
when they have the same top state. This merger 
operation is important for efficiency. As a re- 
suit, the stacl( becomes a. gra.plt instead of a 
simph,, linear state sequence. 
There is already a generalized LR parser 
for natural language processing developed at 
Carnegie Mellon University \[3\]. Nl,yacc diflhrs 
fi'om CMU's system in the following points. 
• NLyacc is written in C, while CMU's in 
Lisp. 
• CMU's cannot handh', c rules, while NI,y- 
ace does. c rules are handful for writing 
natural rules. 
The way to execute semantic actions dif- 
fers. CMU's evaluates an Ll?(\]-like se- 
mantic action attached to each rule when 
reduce action is performed on that rule. 
NLyacc executes a semantic action in two 
levels; one is perfin'med during parsing 
for syntactic control and the. other is per- 
formed onto each successfifl final p;~rse. We 
will desc.ribe the details of NLyacc's ap- 
proach in the next section. 
NLyacc is ,,pper-compatible to Yacc. NLy- 
acc consists of three modules; a reader, a pars- 
ing table constructor, and a drive routine for 
the gene.ralized LR parsing. The reader accepts 
grammar ruh;s in the Yacc format. The table 
constructor produces a generalized LR. parsing 
t;tble instead of the standard I,R. parsing table. 
We describe the de.tails of the parser in the next 
sectiou. 
417 
2 Execution of Semantic Ac- 
tions 
NLyacc differs from Yacc mainly in the exe- 
cution process of semantic actions attached to 
each grammar rule. Namely, Yacc evaluates a 
semantic action a.q it parses the input. We ex- 
amine if this evaluation mechanism is suitable 
for the generalized LR. parsing here. If we can 
assume that there is only one final parse, the 
parser can ewtluate semantic actions when only 
one branch exists on top of the stack. Although 
having only one final parse is often the cruse in 
practical applications, the constraint of being 
unambiguous is too strong in generM. 
2.1 Handling Side Effects 
Next, we examine what would happen if seman- 
tic actions are executed during parsing. When 
a reduce action is performed, the parser eval- 
uates the action attached to the current rule. 
As described in the previous section, the parse 
stack grows in a graph form. Thus, when the 
action contains side effects like an assignment 
operation to a variable shared by different ac- 
tions, that side effect must not propagate to tile 
other paths in the graph. 
If an environment, which is a set of v,zdue of 
variables, is prepared to each path of the parse 
branches, such side effect can be encapsulated. 
When a stack splits, a copy of the environment 
should be created for each branch. When the 
parse branches are merged, however, each en- 
vironment can not be merged. Instead, the 
merged state must have all the environments. 
Thus, the number of environments grows expo- 
nentially as parsing proceeds. Therefore this 
approach decreases the parsing e\[Iiciency dras- 
tically. Also this high cost operation would be 
vain when the parse fails in the middle. To 
sum it up, although this approach retains com- 
patibility with Yacc, it sacrifices efficiency too 
much. 
2.2 Two Kinds of Semantic Actions 
We, therefore, take another approach to han- 
dling semantic actions in NLyacc. Namely, the 
parser just keeps a list of actions to be exe- 
cuted, and performs all the actions after pars- 
ing is done. This method can avoid the problem 
418 
above during parsing. After parsing is done, 
the semantic action evMuator performs the task 
as it traces all the history paths one by one. 
This approach retains parsing efficiency and can 
avoid the execution of useless semantic actions. 
A drawback of this approach is that semantic 
actions can not control the syntactic parsing, 
because actions are not evaluated until tile pars- 
ing is clone. To compensate the cons above, we 
have introduced a new semantic action enclosed 
with \[ \] to enable a user to discard semantically 
incorrect parses in the middle of parsing. 
Namely, there are two types of semantic ac- 
tions: 
An action enclosed with \[ \] is executed 
during parsing .just as done in Yacc. If 
'return 0;' is execute<t in the action, the 
partial parse having invoked this action 
fails and is disca.rded. 
* An action enclosed with { ) is executed al- 
ter the syntactic parsing. 
In the example below, the bracketed action 
checks if the subtraction result is negative, and, 
if true, discar<ts its partial parse. 
number : number '-' number 
\[ $$ = $1-33; if(35 < 0) return 0; \] 
{ $$ = 31-33; print( ..... , 31, $3, $$); } 
2.3 Keeping Parse History 
Our generalized Lll. parsing algorithm is differ- 
ent from tile original one \[2\] in that ore' algo- 
rithm keeps a history of parse actions to exe- 
cute semantic actions after the syntactle pars- 
ing. The original algorithm uses a packe<l forest 
representation for the stack, whereas our algo- 
rithm uses a list representation. 
The algorithm of keeping the parse history is 
shown as follows. 
1) If the next action is "shift s", then make 
< s > as the history, where < s > is a list of 
only one element s. 
2) If the next action is "reduce r : A -+ BIB2 
"".11~", then make append(lh, lI2, ..., IIn, l-r\]) 
as the history, where Hi is a history of Bi, r 
is the rule number of production "A -+ 1~1132 
• "1\],/', an<l the function 'append' concatenates 
multiple lists and returns the result. 
Now we describe how to execute semantic ac- 
tions using the parse history. First, before start- 
ing to parse, the parser ca.lls "yyinit" function 
to initialize wtriables in the semantic actions. 
Our system requires the. user to define "yyinit" 
to set initial values to the variables. Next, the 
parser starts parsing and l)erforms a shift ac- 
tion or a reduce action according to the parse 
history and evaluates the apl)ropriate semantic 
actions. 
2.4 Efficient Memory Management 
We use a list structure to implement the. parse 
stack, because the stack becomes a complex 
grN)h structure as described l)reviously. Be- 
cause the parser discards fa.iled branches of the 
stack, the system rechfims the memory allo- 
cated for the discarded parses using the "mark 
and sweep garhage collection algorithm \[1\]" to 
use memory efficiently. 'Phis garl)age collection 
is triggered only when the memory is exhausted 
in our current implementation. 
3 Distribution 
Portability 
Currently, NLyacc runs on UNIX worksta.- 
tions and DOS personal computers. 
Debugging Grammars 
For grammar debugging, NLyacc provides 
l)arse trace information such as a history of 
shift/reduce actions, execution information of 
'\[\] actions.' 
When NLya.cc encounters an error state, 
"yyerror" function is called just a.s in Yacc. 
Distribution 
NLyacc is distributed through e-mail (ple:tse 
contact nlyacc~nak.math.keio.ac.jp). I)is- 
tribution package includes all the source codes, 
a manual, and some sample grammars. 
References 
\[1\] J. McCarthy. Recursive flmctions of symbolic 
expressions and their computation by machine, 
part 1. Communications of the A CM, 3(4), April 
1960. 
\[2\] M. Tomita. EJficieut Parsing for Nalural Lan- 
guage. Kluwer Academic P.blishers, l~oston, 
MA, 1985. 
\[3\] M. Tomita and J. G. Carbonell. The universal 
parser architecture for knowledge-based machine 
translation. In Proceedings, lOlh hdcvaational 
Joint Um~ference on Arlificial IMelligence (IJ- 
CAI), Milan, A,gust 1987. 
\[J\] yacc - yet another compiler-compiler: l)arsing 
l)rogram generator, in UNLV manual. 
Appendix - Sample Runs - 
A sa,mple grammar helow covers a sm~fll set of 
l'~nglish sentences. The. parser I)ro(h:,ees syntac- 
tic trees ofagiven sentence. Agreement check 
is done by the semantic actions. 
/* grml~ar.y */ 
%{ 
#include <stdio.h> 
#include <stdlib.h> 
#include "gran~ar,h" 
#include "proto.h" 
%} 
%token NOUN VERB DET PREP 
%% 
SS : S 
S : NP VP 
{ pr_tree($1); } 
\[ return checkl($1, $2); \] 
{ $$ = mk_tree2("S", $1, $2); } 
S : S PP { $5 
NP : NOUN \[ $$ 
{ $$ 
NP : DET NOUN \[ $$ 
{ 55 
NP : NP PP \[ 55 
{ $$ 
PP : PREP NP { $$ 
VP : VERB NP \[ $5 
{ $$ 
%% 
FILE* yyin; 
extern int yydebug; 
int main(argc, argv) 
int argc; 
char *argv\[\]; 
{ 
int result; 
= mk_tree2("S", $1, $2); } 
: $1; \] 
: mk treel("NP", $I); } 
= $2; return check2($1, $2);\] 
= mk tree2("NP", $I, $2); } 
= $1; \] 
= mk_tree2("NP", $1, 52); } 
= mk_tree2("PP", $1, 52); } 
= $1; \] 
= mk_%ree2("VP", $1, $2); } 
yydebug = I; 
419 
yyin = stdin; 
read_dictionary("dict"); 
yyinitialize_heap(); 
result = yyparse(); 
printf("Result = Zd\n", result); 
yyfree_heap(); 
return O; 
void yyinit() 
{} 
int yyerror(message) 
char* message; 
{ 
fprintf(stderr, "%s\n", message); 
exit(l); 
} 
int checkl(seml, sem2) 
SEMPTR seml, sem2; 
{ 
return (seml->seigen & sem2->seigen); 
} 
int check2(seml, sem2) 
SEMPTR seml, sem2; 
{ 
return (seml->seigen & sem2->seigen); 
} 
/* grammar.h */ 
#define SPELLING_SIZE 32 
#define HINSBI_SIZE 32 
#define BUFFER_SIZE 64 
typedef struct word 
{ 
struct word *next; 
char *spelling; 
int hinshi; /* parts of speech */ 
int seigen; /* constraints */ 
} WORD; 
typedef enum tag 
{ 
TLEAF, TNDDE 
} TAG; 
typedef struct node 
{ 
TAG tag; 
union { 
WORD* _lea~; 
struct { 
char *_pos; 
struct node *_left; 
struct node * right; 
420 
} _pair; 
} contents; 
} NODE, *NODEPTR; 
#define leaf contents._leaf 
#define pos contents._pair._pos 
#define left contents._pair, left 
#define right contents, pair._right 
typedef WORD SEM, *SEMPTR; 
#define YYSTYPE NODEPTR 
#define YYSEMTYPE SEMPTR 
/* dict */ 
I:NOUN:OI 
You:NOUN:22 
you:NOUN:22 
He:NOUN:04 
he:NOUN:04 
She:NOUN:04 
she:NOUN:04 
It:NOUN:04 
it:NOUN:04 
We:NOUN:IO 
we:NOUN:IO 
They:NOUN:40 
they:NOUN:40 
see:VERB:73 
sees:VERB:04 
a:DET:07 
the:DET:77 
with:PREP:O0 
telescope:NOUN:07 
man:NOUN:07 
Sample Runs 
# sentence no.l 
}\[e sees a man with a telescope "D 
# parse 1 
S:(S:(NP:(NOUN:He) 
VP:(VERB:sees 
NP:(DET:a NOUN:man))) 
PP:(PREP:with NP:(DET:a 
NOUN:telescope))) 
# parse 2 
S:(NP:(NOUN:He) 
VP:(VERB:sees 
NP:(NP:(DET:a NOUN:man) 
PP:(PREP:with 
NP:(DET:a NOUN:telescope))))) 
# sentence no.2 
He see a man "D 
# The semantic actions prune syntactically- 
# sound but semantically-incorrect parses. 
