Chart Parsing of Robust Grmnmars * 
Sebastian Goeser 
gsr@dhdibml.bitnet 
IBM Deutschland GmbH o GADL 
Hans-Klemm-Str. 45 
D-7030 Bfblingen 
1 Introduction 
Robustness is a formal behaviour of natural 
langatage grammars to assign a best partial 
description to linguistic events wltose strong 
description is inconsistent or cannot be con- 
structed. Events of this sort may be called de- 
fective with respect to a grammar fragment. 
Defectiveness arises from the performance use 
that hnman beings make of language. Since de- 
fectiveness can be seen as failure of linguistic 
description, the principal way to robustness is 
a method to weaken these descriptions. 
Robust parsing, then, is parsing of robust 
granmmrs: a parser is robust iff it has the ca- 
pabillty to interpret weak grammar fraKments 
correctly. In this paper, I shall try to substan- 
tiate this claim by motivating a grammar de- 
pendent approach to robust parsing and then 
describing a chart parsing nlgoritbra for ro~ 
bust g ......... rs. Though only c(ontext) f(ree) 
grammars will be adressed, there is an obvi- 
ous extension of the algorithm to annotated 
(unification-) grammars (WACSG formalism, 
see Goeser 1900) along the lines of (Shieber 198~). 
Grammar based robustness tools have been 
explored in a variety of formalisms, e.g. the 
metarule device within the ATN formalism 
(Weischedel and Sondheimer 1898), entity data 
structures in a case frame approach (Hayes 
1984) or the weak description approach in uni- 
fication based grammars (Kudo et al. 1988, 
Goeser 1990). Parsing cf grammars with ro- 
°The work reported has been done while the author 
received an LGF grnnt at the University of Stuttgart. 
bustness features competes with algorithnfic 
approaches to robustness where parsing al- 
gorithms, (usually chart parsers except in 
Tomabechi and Tomita (1988) where LR(k) 
parsing is advocated) are extended to in- 
elude robustness features (Mellish 1989, Long 
1988) and/or heuristics to handle defect cases 
(Banger 1990, Stock et al. 1988). 
Maybe the most critical issue in robust parsing 
is ambigatity, which emerges when constituency 
is loosened to some cf substring analysis. E.g. 
Mellish (1989) p ..... for a cfg G the (cf) set 
PAR(G) which is the set of all strings contain~ 
ing a sequence ofnonempty substrings which is 
in the cflangqtage L(G) I In the worst case sce- 
nario where all these seqaences are in L(G), we 
get for a w E L(G) with an ambiguity k (in 
G) an exponential ambiguity of k x 2 I'1 as mx 
upper bound. Even in a non-worst cast, which 
should be the case of realistic cfgs, local am- 
biguities from substring analysis massively in- 
crease parsing time. E.g. in the (non-defective) 
example 1, the arcs a, b, c are empirically valid 
while the arcs d,e are artefacts of m~ algorithm 
parsing PAR(G). 
1See Goeser (1990) for a more formal discussion of 
PAn(C). 
ACRES DE COLING-92, NANTES, 23-28 AO13T 1992 1 2 0 PROC. OF COLING-92, NANTES, AUo. 23-28, 1992 
(i) 
S(| 
Peter • 1 
abrings tat2 b I 
nice "4 
gift ,~ 
to es 
A Mary ®r 
Reflecting syntactic defectiveness in a cfg 
metros to n-~sigqt it a coxtfigtlrational regular- 
Sty. Obviously, there is syntactic defectivity 
which is syntactically nonregalar, such as cor- 
raq~ted output from a speech recognition de- 
vice (Tomabechi and Tomita 1988) ~ or global 
constituent breaks (Goeser 1991), which can 
be subjected to syntactic prefix analysis only. 
On the other hand, there are spoken language 
constructions (Lindgren 1987, Goeser 1991, 
Langer 1990) and various kinds of "fragmen- 
tary utterances" (Cnrbonell and ltnyes 1983) that 
definitively show configurational proper- 
ties. 
Let us look at ~ frequent spoken language con- 
struction called restart, as in the Germml col 
pus exmnple (2) ~. ll.estarts follow a pattern 
< c~/3 ,,4 /~3' > where the strings c~ and 7 but not/5 and 
f~' may be empty. The restart marker 
A is optional: in 67 from 96 restart smnples/3, 
which mostly ends in a constitnent break, and 
/3' were separated phonologically by tone con- 
stancy, a short pause or without any marking 
at all 4. Restarts are a kind of constituent co- 
ordination not aUowing for ellipsis phenomena 
such as gapping, left deletion, split coordina- 
tion or sluicing. The ~ substring is usually de- 
fective and may indeed contain arbitrary noise 
~This mnt~riM wmy Jllow phonologlcM regulariliea, 
of courlc 
s All coxplls evidence reported here ia psychothera- 
peutlc discourle frott~ tire ULMER TEXTBANI( 
t Therefor% IJanger'l (19Ofl) rettart hemrktlcs teems 
empirically iltadequate inaafnr at it pomttdate$ a lyn- 
tactic restart marker. 
(see e.g, example (3)) ~ 
(2) da \[is es d ....... dt ein A 
there \[ is it then still a A 
kmnmt noch ein anderes Problem hinzu\] 
comes yet another problem to-that\] 
(3) der Peter \[ hat konnte das dieses deshalb 
the Peter \[ has could the this therefore 
ehemaligen Lieferwagen 
former truck 
A hat das gekauft\] 
,.4 has it bought\] 
2 lteeursive partial string 
grammars 
Reenrslve partial string grammars (RPSGs) 
are cfgs with a set of start symbols and with 
rules whose left hand side may be indexed with 
the keyword SET, SUB, or PAR. The SET 
index on a rule'! tits licenses the adjlmetion of 
any start symbol to the right or left of its RHS 
string. The SUB index licenses arbitrary ter- 
minal strings to the right or left of the indexed 
symbol's lexied projection. The PAR index 
includes SUB and additionMly licenses any 
terminal strings within this lexlcal projection. 
(Left and right sided indices SETL, SUBL 
and SETII, SUBR,respeetively, are also in 
use). In a derivation relation --~, for RPSGs 
an indexed symbol A, r unifies with category A 
to give A w Formally, SET adjnnetion partici- 
pates in the cf derivation relation, while SUB 
and PAIl are interpreted by a recursive gener- 
ation function gen operating on derivations: 
where to is a derivation, t its tree structure, 
Cat;~d the set of indexed or non-indexed non- 
ternfnals and Lea: the set of terminals.The ex- 
ample deri*ation tree (4) shows ,SET adjune- 
tion (dotted llne~) and areas where arbitrary 
tFor a more thorot~h dlacutllon of reitart *yntax, 
lee Goe0er (1991). 
ACRES DE COLING-92, NANa1.:S, 23-28 AOOi" 1992 1 2 1 PROC. OV COTING-92, NAh"rES, AUO. 23-28, 1992 
sabstrings m'e licensed by an indexed node. 
Generally, local arbitrariness within a string 
may be rally modened with an RPSG. Though 
finite cfls are turned into infinite ones through 
RPSG indexing, the syntactic description with 
RPSG is still configurational up to certain local 
adjnrtctiorts. 
3 Basic algorithm 
As a parsing algorithm to start from, Earley's 
(1971) chart parser has been chosen, which 
h~-s a top-down component adaptable to the 
top-down percolation ofirtdex infornmtion, and 
which guarantees a worst case complexity of 
O(n ~) even for mnaximal ambiguity. We use the 
declarative Earley variant in D/irre (1987). For 
a cfg G = < Cat, Lex, P, ,qset >, where Cat is a 
set of non-terminals, Lez a set of terminals, P 
a set of rules and ,qset a set of start symbols, 
it is charact,;ri~ed by the fonowing predictor 
concept: 
* the predictor is a relation D(i,A) C 
n + x C, al between a vertex i < n and 
a rtort-termirtal .,4. It is integrated into 
the completer and scanner components 
(see below), Tlfis has the advantage that 
no cyclic items i.e. items with an empty 
string of parsed symbols, have to be as- 
serted to the chart. 
* initialization is the special predictor case 
D(0, S) where 6' is a start symbol. 
Let V = Cat U Le:e, A --* ,~fl E P and 
0 < i < j '< n. Chart\[i,j\] be the set of arcs 
between vertices i and j and ~ be the transi- 
tive cover of the derivation relation. Then ev- 
ery item in the chart may be characterized by 
the following membership condition 6 which 
respects both top-down (TD) and bottom-up 
(BU) information. Remark that for the (ba- 
sis variant of the) Earley algorithm, while item 
nrembership depends on top-down predictor in- 
formation, the acceptance of inpnt strings is 
independent of the predictor (Kilbury 1985). 
A--~.B c C, hortli, j\] iff 
~Jec DSrre 198'T 
\[TD \] ~SE Sset S -*~ wO'~A~ A 
\[BU \] ~ ~ ~,-~ 
where ~5 ~ V ~ 
4 The RPSG variant 
4.1 Item Concept 
h~ the RPSG variant, items are represertted as 
PROLOG facts 
item( lumber, Lind, Rirtd, LRS, 
Pazsod, To_Parso~ RofList) 
where item number, the -possibly indexed- left 
hand symbol, the list of parsed symbols and 
the list of symbols yet to parse are well-known 
item parts. The variables Lind and Rind rep- 
resent tile status of snbstring generation to tlle 
left and to the right of the Parsed string, re- 
spectively. Lind # Rind is possible even for the 
SUB index, since items represent prefix infor- 
mation on a constituent, whereas a PAR index 
always effects Lind -- Rind. Partial string in- 
formation from higher nodes, which is justified 
only within the appropriate derivation, nmst 
be distinguished from SUB or PAR indexing 
of art item's LHS symbol, which rtlways licences 
arbitrary substrings. To allow reconstructiort of 
a derivation, RefList records the pairs of items 
(or pairs of rule and item, see below) an item 
is completed from, or it equals lex for lexical 
items 'r. To state the chart membership con- 
dillon of the RPSG variant, we g,~,eralize the 
hnction gen to nat argnment pair of strings of 
terminals and possibly indexed rton-termirtals: 
gen* ' W 4 ~ {0, l} 
where 
gen*(cq/~) = \] iff~3 can be generated from c~ 
lad ) 
The RPSG membership condition, then, is: 
A~---~c~.fi C Chart\[i,j\] iff 
lion, tee e,g. Doerre (198"/) for a discussion 
ACTES DE COLING-92, NANTES, 23-28 Aofrr 1992 1 2 2 PROC. OF COLING-92, NArCrES, AUO. 23-28, 1992 
(4) 
=Peter':elf~'~ den Peter ~gefaellt --A'interessiert die Schule sehr 
\[TD \] 3S E b'set.,a tle.*(S, ,,,°';A,~) = 1 ^ 
where c~,fl,,g ¢ (~,,.~)" 
4.2 The Predictor 
The predictor of the RPSG variant s is, again, 
a relation over vertices and nou-ternfinals. \]ha 
contrast to the basis variant, however, a null 
predictor would be incorrect for the RPSG 
variant, since the acceptance of a string now 
depends on the substring information perco- 
lated by lhc predictor. The. first predictor 
clause allows an "initialisation" for every ver- 
tex. The second clause formulates the expecta- 
tion of a non-terminal A, I by an active item i.e. 
an item with a nonempty llst To-Parse, and the 
tltird the expectation by passive items with a 
SET index. Clause 4 expects a start synd)ol on 
the basis of left adjunction to a SET indexed 
symbol. The following proposition, a proof of 
wbid~ is available from the anthor, states the 
correctness of this predictor formalization. 
.¢en * ( S, ,o "'~ A,~g ) = 1 iff D ( i, A,, ) 
for a S E Sseti,,,l 
4.~ The Completer 
The completer component integrates the pre- 
dictor relation and the substring generation 
function and has two rules for rightside and 
~see Appendix A for a complete formal characteri- 
t~ation of the RPSG chart parser 
leftside mljunction under a set-indexed sym- 
bol. Given that the conditions in the if-clause 
(and the lookahead condition, see below) yield, 
tlte completer adds new items to the chart 9 
Clansc I of the RPSG completer, is, up to 
the generation function instead of derivation, 
equivalent to the completer of the basis vari- 
t~nt: Given a rightslde passive item, it adds a 
new item both for a matching active item and 
for the prediction of an appropriate rules's LtlS 
symbol. Tltus, no cyclic items have to be cre- 
ated. Furthermore, since RPSGs do not have 
productions, there is no need to handle cyclic 
items at all. Clause 2 does riglitsld- ndjnnc- 
lion of a start symbol item to a passive SET 
indexed item. \]ht left a~unction according to 
clause 3, the adjoined (passive) item can again 
be licensed both by another (active or passive) 
SET indexed item or by the predictor relation. 
4.4 Scanner and Lookahead 
~illCe tile scanller conlponellt lIIS~v ~-)e been as 
n lexical case of the completer, )h~ RPSG al- 
gorithm could be reduced to a single active 
completer component and the controlling rela- 
tion D (Kilbury 1985). Remark thai the scan- 
net allows for IIPSG rules with RtlS strings of 
terminals and non-terminMs. A partial looks- 
head of 1, being applied to active items only, 
has proven advantageous in the basic variant 
(DSrre 1987). lu the RPSG variant, the length 
of the lookahead must be conditioned to the 
fact that zero or more non-derived but gen- 
erated words may follow a given vertex. The 
lookahead fails if, for the first To-Parse sym- 
The relation F il~cludes the operation ~) which pro- 
cedura)ly asserts new items 2o the chrttt 
AcrEs DE COLING-92, NANTES, 23-28 ASSET 1992 1 2 3 PROC. Of COLING-92, NANTES. AUtL 23-28, 1992 
bol, there is no first derivable lexical item, that 
is accessible given the actual substring infor- 
mation. 
Unfortunately, the scanner is not independent 
from this lookahead, since, in many cases, the 
item licensed by a lookahead operation onto 
o lexical item i is exactly the item licensing i 
within the predictor relation. That is, from a 
procedural viewpoint of enterlng items into the 
chart, the lookahead condition and the predic- 
tor block each other for certain lcxical items. 
In this situation we decided to have a scanner 
without a predictor relation, thus paying for 
lookahead with an increased local lexical am- 
biguity. 
5 Status and Conclusion 
The algorithm described has been imple- 
mented and tested as part of the WACSG sys- 
tem that is based on the Stuttgart LFG system 
(Eisele 1987). 
Chart parsing of robust cf gzammars is a pow- 
erful method to cope with the confignrational 
aspects of defectiveness. It is part of a ma- 
jor enterprise to re-analyze robustness not as o 
parsing problem but as a problem of weak lin- 
guistic description. Therefore, any formal work 
on the linguistics of defectiveness can be ex- 
pected to improve our methods of robust pars- 
ing. 
ACRES DE COLING-92, NANTES, 23-28 AO~ 1992 1 2 4 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
Appendix 
Algorithm: An RPSG Chart Parser 
Input: 
1. RPSG G =< Caq.a, Lez, P, Sseti.~ 
2. string w : wl,..w,, 
Output: 
"accepted", if S----~. E Chaet\[i,j\] where 
S 6 Sseti.a and ffen*(a,w °'n) =- 1 
condition (predictor) : 
Let D(i,A.) C_ n + x Caq,~a 
D(~, A.) ifr 
1. ~S~ 6 Sset~.a gen*(S~,giA.~) = 1 or 
2. ~ C¢---~.BxI5 6 Chartlj, k\] k < i A 
gen*(Bx,g~-tA.6) : 1 or 
3. 3CssT~c~. 6Cha~t\[j,k\] k<i ^ 
3D( 6 Ssetl.a ffen+(D¢, ff~-kh,/5) : I or 
4. 3SnE Sseti,,a gen*(Sn, w"'iC¢~) --- t A A, r ~ Sseti,,a A 3CsRT ~/3 6 P 
condition (Iookahead) : 
Let FC P° × n 2. 
F(c,,, --, ~.y, i, i) i~ 
1. (tY:, or 
/9' :B/5 and gen*(B,g~-Jwt'~+l~) = t 
for B 6 Cati.,l , j < k < n ) and 
2. C. .... fl' ~ Chartli,\]\] 
AcrEs DE COLING-92. NANTES. 23-28 AOt*rl" 1992 1 2 5 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 
method: 
• scanner: For 0 < i < j < n: 
if B(---~wi'i+~w'w j-ld E P (where w' C PP,,,u oderw'=e) 
9en~(B¢, w id) = I , 
then F(//~ --+wi,~+lw'wS-~'¢.,i,j) 
and 
* completer: For 0 _< i < j < I < n: 
i. if 
D(j, An) and A n -~B/3 E P and ~=e) mad 
B(-~7. E Chartlk,11 and genT(aBc,w ~'') = 1, 
then F(A~ --~ c~B~ ./3,1,1) 
~. if B~-~3". E Chnrt\[k,l\] mad 
As~r--~. E Chart\[i j\] and 
then F(AsBr ---~ c~B~ .,i,j ) 
3. if A,----*a. E Chart\[i,j\] and 
(Bs~T--'/3.3' E Chart\[k,1\] 
D(l, Bs.~r) and /3 = e 
gen*(A,/J,w i't = 1) , 
then F(Bs~e .---* An/3.'r,i,1 ) 
B E Sset and 
gen*(aBc, u, ~,t) , 
A,; E Sset and 
or 
and BSr~T ~.'y E P ) and 
ACRES DE COLING-92, NANTES. 23-28 AO~f 1992 1 2 6 PROC. or COLING-92, NANTES, AUG. 23-28, 1992 

Bibliography 

Carboncll, J. and Hayes, P.: Recovery 
Strategies for Parsing Extragrammatica\] 
Language, in: AJCL 9, 3-4, 1983 

D~,rre, J.: Wcitcrentwicklung des Earley- 
Algorithmus flit kontextfreie and ID/LP- 
Graanmatiken, LiLog-Report 28, IBM 
Deutschland 1987 

Earley, J.: An Efficient Context-free Pars- 
ing Algorithm, in: CACM 13, 2, 1970 

Goeser, S.: A linguistic Theory of Robust- 
hess, in: Proc. of COLING-13, Helsinki 
1990 

Goescr, S.: Eine linguistische Theorie der 
Robustheit, Konstanz 1991 

Hayes, P.J.: Entity-Oriented Parsing, in: 
COLING-1O, Stanford 1984 

Kilbury, J.: Chart Parsing and the Ear- 
ley algorithm, in: Klenk, U. (ed.): Kon- 
textfreie Syntaxen und verwandte Sys- 
teme, Max Niemeyer, Tiibingen 1985 

Kwasny, S.C. and and Sondhcimer, N.K.: 
Relax~tlolt Techniques for Pars- 
ing Grammatically m-Formed Input, in: 
AJCL 7,2, 1981 

Lung, B.: Parsing Incomplete Sentences, 
in: Proc. COLING-12, Budapest 1988 

Langer, H.: Parsing Spoken Language, in: 
Proc. COLING-13, Helsinki 1990 

Mcllish, C.S.: Some Chart-Based Tech- 
niques for parsing HI-formed Input, in: 
Proc. ACL 27, V~mcouver 1987 

Shieber, S.M.: Using Restriction to Ex- 
tend Parsing Algorithms for Complex 
Feature Based Formalisms, in: Proc. 
ACL 25, 1985 

Stock, O., Falcone, R, Inslmnamo, P.: Is- 
land Parsing and Bidirectional Charts, 
in: Proc. COLING 12, Budapest 1988 

Tomabechi, H. and Tomita,M.: The In- 
tegration of Unificotion-B~sed Pragmat- 
ics for Real-Time Understanding of 
Noisy Continuous Speech Input, in: Proe. 
AAAI 7, Saint Panl 1988. 

ULMER TEXTBANK: A 
machlne-readable corpus of spoken lan- 
guage from psychotherapeutic discourse, 
University of Uhn 

Weischedel, R.M. and Sondhelmer, N.K.: 
Metarules as n Basis for Processing HI- 
Formed Input, in: AJCL 9, 3-4, 1983 
