Parsing Incomplete Sentences 
Bernard LANG 
INRIA 
B.P. 105, 78153 Le Chesnay, France 
lang@inria.inria, fr 
Abstract 
An efficient context-free parsing algorithm is presented 
that can parse sentences with unknown parts of unknown 
length. It pa'oduees in finite form all possible parses (of- 
ten infinite in number) that could account for the missing 
parts. The algorithm is a variation oa the construction 
due to Earley. ltowever, its presentation is such that it 
can readily be adapted to any chart parsing schema (top- 
down, bottom-up, etc...). 
1 Introduction 
It is often necessary in practical situations to attempt parsing an 
incorrect or incomplete input. This may take many forms: e.g. 
missing or spurious words, misspelled or misunderstood or oth- 
erwise unknown words \[28\], missing or unidentified word bound- 
aries \[22,27\]. Specific techniques may be developed to deal with 
these situations according to the requirements of the application 
arcs (e.g. n~tural language processing, progrmrmfing language 
parsing, tea:i-time or off-line processing). 
The con~lext-fi.ee (CF) parsing of a sentence with unknown 
words hss been considered by other authors \[28\]. Very simply, 
an unknown word may be considered as a "special multi-part-of- 
speech word whose pa'ct of speech can be anything". This multi- 
psi't-of-speech word need not be introduced in the CF grammar 
of the lang0age, but only implicitly in the construction of its 
parser. Thi;~ works very well with Earley-like (chart) parsers 
that can simulate all possible parsing paths that could lead to 
a correct parse. 
In this paper, we deal with the more complex problem of 
parsing a ser*.tence for wtfich one or several subparts of unknown 
length are roissing. Again we can use a chart parser to try all 
possible parses on all possible inputs. However the fact that the 
length of th~ 1*fissing subsequence is unknown raises an addi- 
tional difficulty. Many published chart parsers \[24,28,23,21\] are 
constructed ~,ith the assumption that tim CF grammar of the 
language ho~', no cyclic rules. Tlfis hypothesis is reasonable for 
the syntax ol natural (or programming) languages. However the 
resulting simplification of the pm'ser construction does not allow 
its extension to parsing sentences with unknown subsequenees 
of words. 
If the length (in words) of the missing subsequence were 
known, we could simply replace it with as many unknown words, 
a problem we know how to handle. When this length is not 
known, the tdgorithm has to simulate the parsing of an arbi- 
trary numbe~: of words, and thus may have to go several tim~ 
tht'ough reduction by the same rules of the grammar 1 without 
ever' touchinl; the stack present before scanning the unknown 
t~equenee, aml without reading the input beyond that sequence. 
If we consider the unknown sequence as a special input word, 
wc are in a situation that is analogous to that created by cyclic 
grammars, i.~. g~amrnars where a nonterminal may derive onto 
IThis grammar oriented view of the computation of the autonmton is 
only meant as a support for intuition. 
itself without producing any terminal. This explains why tech- 
niques limited to non-cyclic grammars cannot deal with this 
problem. 
It may be noted that the problem is different fi'om that of 
parsing in a word lattice \[22,27\] since all possible path in the 
lattice have a known bounded length, even when the lattice 
contains separated unknown words, tIowever the technique pre- 
sented here combines well with word lattice parsing. 
The ability to parse unknown subsequences may be ~seful 
to parse badly transmitted sentences, and sentences that arc 
interrupted (e.g. in a discussion) or otherwise left unfinished 
(e.g. because the rest may be inferred from the context). It 
may also be used in programming languages: for example the 
programming language SETL \[9\] allows some statements to be 
left unfinished in some contexts. 
The next section contains an introduction to all-paths pars- 
ing. In section 3 we give a more detailed account of our basic 
algorithm and point at the features that allow the handling 
of cyclic grammars. Section 4 contains the modifications that 
make this algorltlml capable of parsing incomplete sentences. 
The fifll algorithm is given in appendix C, while two examples 
are given in appendices A and B. 
2 All-Paths Parsing 
Since Earley's first paper \[10\], many adaptations or improve- 
ments of his ~flgorithm have been published \[6,5,24,28\]. They 
are usually variations following some chart parsing schema \[16\]. 
In a previous paper \[18\], the author attempted to unify all these 
results by proposing an Earley-like construction for all-paths in- 
terpretation of (non-deterministic) Push-Down-Transducers 
(PDT). The idea was that left-to-right parsing schemata may 
usually be expressed as a construction technique for building a 
recognizing Push-Down-Automaton (PDA) from the CF gram- 
mar of the language. This is quite apparent when comparing 
the PDA constructions in \[12\] to the ctmrt sche,nata of \[16\] 
which are now a widely accepted reference. Thns a construc- 
tion proposed for general PDTs is de facto applicable to most 
left-to-right parsing schemata, and allows in particular the use 
of well established PDT construction teclmiques (e.g. prece- 
dence, LL(k), LR(k) \[8,14,2\]) for general CF parsing. 
In this earlier paper, our basic algorithm is proved correct, 
and its complexity is shown to be O(n3), i.e. as good as the 
best general parsing algorithms 2. As is usual with Earley's 
construction 3, the theoretical complexity bound is rarely at- 
tained, and the algorithm behaves linearly most of the time. 
Further optimizations are proposed in \[18\] that improve this 
behavior. 
Most published variants of Earley's algorithm, including Ear- 
ley's own, may be viewed as (a sometimes weaker form of) our 
construction applied to some specific PDA or PDT. This is the 
~Theoretically faster algorithms \[29,7\] can achieve O(n ~'4~6) but with an 
unacceptable constant fi~ctor. Note also that we do not require the grammar 
to be in Chomsky Normal Form. 
SAnd unlike tabular algorithms such as Cocke-Younger-Kasami's \[13,15, 
30,11\]. 
365 
explicit strategy of Tomita \[28\] in the special case of LALR(1) 
PDT construction technique. A notable exception is the very 
general approach of Shell \[25\], though it is very similar to a 
Datalog extension \[19\] of the algorithm presented here. 
An essential feature of all-paths parsing algorithms is to be 
able to produce all possible parses in a concise form, with as 
much sharing as possible of the common subparses. This is 
realized in many systems \[6,24,28\] by producing some kind of 
shared-forest which is a representation of all parse-trees with 
various sharings of common subparts. In the case of our al- 
gorithm, a parse is represented by the sequence of rules to be 
used in a left-to-right reduction of the input sentence to the 
initial nonterminal of the gramnmr. Sharing between all pos- 
sible parses is achieved by producing, instead of an extension- 
ally given set of possible parse sequences, a new CF grammar 
that generates all possible parse sequences (possibly an infinite 
number if the grammar of the input language is cyclic, and if 
the parsed sentence is infinitely ambiguous). With appropri- 
ate care, it is also possible to read this ontput grammar as a 
shared-forest (see appendix A). However its meaningful inter- 
pretation as a shared-forest is dependent on the parsing schema 
(cf. \[12,16\]) used in constructing the PDT that produces it as 
output. Good definition and understanding of shared forests 
is essential to properly define and handle the extra processing 
needed to disambiguate a sentence, in the usual case when the 
ambiguous CF grammar is uscd only as a parsing backbone 
\[24,26\]. The structure of shared forests is discussed in \[4\]. 
Before and while following the next section, we suggest that 
the reader looks at Appendix A which contains a detailed exam- 
ple showing an output grammar and the corresponding shared 
forest for a slightly ambiguous input sentence. 
3 The Basic Algorithm 
A formal definition of the extended algorithm for possibly in- 
complete sentences is given in appendix C. The formal aspect 
of our presentation of the algorithm is justified by the fact that 
it allows specialization of the given constructions to specific 
parsing schema without loss of the correctness and complex- 
ity properties, as well as the specialization of the optimization 
techniques (see \[18\]) established in the general case. The exam- 
ples presented later were obtained with an adaptation of this 
general algorithm to bottom-up LALR(1) parsers \[8\]. 
Our aim is to parse sentences in the language /:(G) gen- 
erated by a CF phrase structure grammar G = (V,\]E, YI,~) 
according to its syntax. The notation used is V for the set of 
nonterminal, \]E for the set of terminals, YI for the rules, and 
for the initial nonterminal. 
We assume that, by some appropriate parser construction 
technique (e.g. \[14,8,2,1\]) we mechanically produce from the 
grammar G a parser for the language £:(G) in the form of a 
(possibly non-deterministic) push-down transducer (PDT) TG. 
The output of each possible computation of the parser is a se- 
quence of rules in H a to be used in a left-to-right reduction of 
the input sentence (this is obviously equivalent to producing a 
parse-tree). 
We assume for the PDT 7G a very general formal defini- 
tion that can fit most usual PDT construction techniques. It 
o o is defined as an 8-tuple T(~ = (q, E, A, II, 6, q, $, F) where: Q 
is the set of states, \]E is the set of input word symbols, & is the 
set of stack symbols, II is the set of output symbols (i.e. rules 
of G), ~l is the initial state, $ is the initial stack symbol, F 
is the set of final states, 6 is a finite set of transitions of the 
4Implementations usually denote these rules by their index in the set II. 
form: (pAa~--~ qBu) with p, qEq, A,BEZXU{E&}, 
aE~U{e~),and uEII*. 
Let the PDT be in a configuration p = (p Aa a~ u) where p 
is the current state, Aa is the stack contents with h on the top, 
ax is the remaining input where the symbol a is the next to be 
shifted and x E ~E*, and u is the already produced output. The 
application of a transition r = (p A a ~ q B v) results in a new 
configuration p' = (q Ba x uv) where the terminal symbol a has 
been scanned (i.e. shifted), A has been popped and n has been 
pushed, and v has been concatenated to the existing output u. 
If the terminal symbol a is replaced by e:~ in the transition, no 
input symbol is scanned. If A (resp. B) is replaced by e~ then 
no stack symbol is popped from (resp. pushed on) the stack. 
Our algorithm consists in an Earley-like 5 simulation of the 
PDT TG. Using the terminology of \[2\], the algorithm builds 
an item set Si successively for each word symbol xi holding 
position i in the input sentence x. An item is constituted of two 
modes of the form (p A i) where p is a PDT state, A is a stack 
symbol, and i is the index of an input symbol. The item set 
Si contains items of the form ((p A i) (q B j)) . These items are 
used as nonterminals of a grammar ~ = (S, II, P, Uf), where 6' 
is the set of all items (i.e. the union of St), and the rules in 
are constructed together with their left-hand-side item by the 
algorithm. The initial nonterminal Uf of ~ derives on the last 
items produced by a successful computation. 
The meaning of an item U = ((p A i) (q n j)) is the following: 
• there are computations of the PDT on the given input 
sentence that reach a configuration pt where the state is 
p, the stack top is A and the last symbol scanned is xi; 
• the next stack symbol is then B and, for all these compu- 
tations, it was last on top in a configuration p where the 
state was q and the last symbol scanned was xj; 
• the rule sequences in l-I* derivable from U in the grammar 
are exactly those sequences output by the above defined 
comput~:tions of the PDT between configurations p and p~. 
In simpler words, an item may be understood as a set of 
distinguished fl'agments of the possible PDT computations, that 
are independent of the initial content of the stack, except for its 
top element. Item structures are used to share these fragments 
between all PDT computations that can use them, so as to 
avoid duplication of work. In the output grammar an item is 
a nonterminal that may derive on the outputs produced by the 
corresponding computation fragments. 
The items may also be read as an encoding of the possible 
configurations that could be attained by the PDT on the given 
input, with sharing of common stack fragments (the same frag- 
ment may be reused several times for the same stack in the case 
of cyclic grammars, or incomplete sentences). In figure 1 we 
represent a partial collection of items. Each item is represented 
by its two modes as (Kh Kh,) without giving the internal struc- 
ture of modes as a triples (PDT-state × stack-symbol × input- 
index). Each mode Kh actually stands for the triple (pa A h ih). 
We have added arrows from the second component of every item 
(Kh Kh,) to the first component of any item (Ku Kh,,). This 
chaining indicates in reverse the order in which the correspond- 
ing modes are encountered during a possible computation of the 
PDT. In particular, the sequence of stack symbols of the first 
modes of the items in any such chain is a possible stack con- 
tent. Ignoring the output, an item (Kh K^,) represent the set 
of PDT configurations where the current state is p~,, the next 
input symbol to be read has the index ih + 1, and the stack con- 
tent is formed of all the stack symbols to be found in the first 
mode of all items of any chain of items beginning with (Kh Kh,). 
Hence, if the collection of items of figure 1 is produced by a 
dynamic programming computation, it means that a standard 
non-deterministic computation of the PDT could have reached 
5We assume the reader to be familiar with some variation of Earley's 
algorithm. Earley's original paper uses the word s~ate instead of i~em. 
366 
Figure 1: f~ems as shared representations of stack eozffigurations 
state I)1, having last read the input symbol of index il, and 
having buitt any of tile following stack configurations (among 
others), with tim stack top on the left hand side: A1A2As..., 
A1A2A3A7 . ., A1A2AaAfA6..., A1A2AsAsAs..., A1A2A4AaAbAs . .., 
A1A2A4AbAs..., and so on. 
The transitions of tlm PDT are interpreted to produce new 
items, and new associated rules in 5 ° for the output grammar ~, 
as described in appendix C. When the same item is produced 
several times, only one copy is kept in the item set, but a new 
rule is produced each time. This merging of identical items 
accounts for the sharing of identical subeomputations. The cot- 
responding rules with stone left-hand-side (i.e. the multiply pro 
dueed item) account for santo of the sharing in the output (of. 
appendices A & B). Sharing in the output also appears in the 
use of the :,ame item in the right hand side of sevcral different 
output rules. This directly results from the non-determinism of 
the PDT computation, i.e. the ambiguity of the input sentence. 
The critical feature of the algorithm for handling cyclic rules 
(i.e. infinite ambiguity) is to be found in the handling of pap- 
ping transitions 6. When applying a popping transition r = 
(p A eI:i ~ r e~. z) to the item C = ((p A i) (q la j)) the alga- 
rithm mu,*t find all items Y = ((q, j)(s D k)), i.e. all items 
with first mode (q B j), produced and build for each of then, 
a new itera V = ((r Jl i) (s D k)) together with the output rule 
(V-~ YUz) to be added to 70. The subtle point is that the 
Y-items must be all items with (q B j) as first mode, including 
those that, when j = i, may be built later in the computation 
(e.g. because their existence depends on some other V-item 
built in that step). 
4 Parsing Incomplete Sentences 
In order to handle incomplete sentences, we extend the input 
vocabulary with 2 symbols: "?" standing for one unknown word 
symbol, and "*" standing for an unknown sequence of input 
word symbols ~. 
Normally a scanning transition, say (p e a ~ r e z), is ap- 
plicable to ~tx~ item, say U = ((p A i) (q B j)) in ,-qi, only when 
a == xi+l, wlmre xi+, is the next input symbol to be shifted. It 
produces a ,law item in 5:1+1 and a new rule in 7 °, respectively 
V ~-: ((rA i+l)(qllj)) and (V-+ Uz) for the above transition 
and item. 
When the next input symbol to be shifted is xi+l = ? (i.e. the 
unknown input word symbol), then any scanning transition may 
6Popping transitions are also the critical place to look at for ensuring 
O(n a) worst ease complexity. 
7Several adjacent "*" are equivalent to a single one. 
be applied as above independently of the input symbol required 
by the transition (provided that the transition is applicable with 
respect to PDT state and stack symbol). 
When the next input symbol to be shifted is x~+l = * (i.e. the 
unknowlt input subsequence), then the algorithm proceeds as 
for the unknown word, except that the new item V is created in 
item set 8~ instead of b'i+l, i.e. V = ((r A i) (q B j)) in the case 
of the abow; example. Thus, in the presence of the unknown 
symbol subsequence *, scanning transitions may be applied any 
number of times to the same computation thread, without shift- 
ing the input stream s . 
Scanning transitions are also used normally on input sym- 
bol xi+2 so as to produce also itetns in ,S~+:, for example the 
item ((r A i+2) (q B j)), assuming a =-- xi+~ in the case of the 
above example 9. This is how computation proceeds beyond the 
ltllknown subscquenee. 
There is a remaining difficulty due to tile fact that it may be 
hard to relate a parse sequence of rules in II to the input sen- 
tence because of the unknown nmnber of input symbol actually 
assumed for all occm'rence of the unknown input subsequence. 
We solve this difficulty by including tile input word symbols in 
their propel" place in parse sequences, which can thus be read 
as postfix polish encodings of tile parse tree. In such a parse 
sequence, the symbol * is included a number of times equal to 
the assumed length of the corresponding unknown input subse- 
qucnce(s) for that parse (cf. appendix B). 
A last point concerns simplification of the resulting gram- 
mar (~, or equivalently of the corresponding shared-parse-forest. 
In practice an unknown subseque, nce may stand for an arbi- 
trarily complex sequence of input word symbols, with a col 
rcspondingly complex pars(" structure. Since the subsequence 
is unknown anyway, its hypothetical structures (:all be summa- 
rized by the nonterminal symbols that dominate it (thanks to 
context-fl'eeness). 
Hence the output parse grammar ~ produced by our algo- 
rithm may be simplified by replacing with the unknown subse- 
quence terminal *, all nonterminals (i.e. items) that deri,e only 
on (occurrences of) this symbol. However, to keep the output 
readable, wc usually qualify these * symbols with the appro- 
priate nonterminal of tile parsed language grammar G. The 
substructures thus eliminated can be retrieved by arbitrary l~e 
of the original CF grammar of the parsed language, whici~ thus 
complements the simplified output gramma.P °. An example i,~; 
given in appendix B. 
5 Conclusion 
We have shown that Earley's construction, when correctly ac- 
cepting cyclic grammars, may be used to parse incomplete sen-. 
tences. The generality of the construction presented allows its 
adaptation to any of the classical parsing schemata \[16\], and 
the use of well established parser construction techniques to 
achieve efficiency. The formal setting we have chosen is to our 
knowledge the only one that has ever been used to provc the 
correctness of the constructed parse forest as well as that of the 
recognizer itself. ~¢Ve believe it to be a good framework to study 
SNote that in such a situation; a rule X -~ aX of the language grammar 
G behaves as if it were a cyclic rule X --* X, since the parsing proceeds 
as if it were ignoring terminal symbols. This does not lead to an infinite 
computation since ohly a finite number (proportional to i) of distinct items 
can be built in 8~. 
SWe assume, only for simplicity of exposition, that * is followed by a 
normal input word symbol. Note also that 8i+1 is not built. 
l°If the input were reduced to the unknown subsequence alone, the output 
grammar ~ would be equivalent to the original grammar 151 of the input 
language (up to simple transformation). The output parse sequences would 
then simplify into a single occurrence of the symbol * qualified by the initial 
nonterminal I~ of the \]augusta grammar G. 
367 
the structure of parse forests \[4\], and to develop optimization 
strategies. 
Recent extensions of our approach to recursive queries in 
Datalog \[19\] and to Horn clauses \[20\] are an indication that 
these techniques may be applied effectively to more complex 
grammatical setting, including unification based grammars and 
logic based semantics processing. More generally, dynamic pro- 
gramming approaches such as the one presented here should 
be a privileged way of dealing with ill-formed input, since the 
variety of possible errors is the source of even more combina- 
torial problems than the natural ambiguity or non-determinism 
already present in many "correct" sentences. 
Acknowledgements: Sylvie Billot is currently studying 
the implementation technology for the algorithms described here 
\[3,4\]. The examples in appendices A & B were produced with 
her prototype implementation. The author gratefully acknowl- 
edges her commitment to have this implementation running 
in time, as well as numerous discussions with her, V~ronique 
Donzeau-Gouge, and Anne-Marie Vercoustre. 
A 3:im~)ie example wi~,hout unknown 
input subsequence 
Tbi,'~ first simple exanrple, without unknown input, is intended 
to fiunilia~:ize the ' with our rem:u~r constructions. 
A.~I Craxnxnar of the analyzed language 
'l'i~ia grmr.m~' is taken fl'om \[28\]. 
Nonterndna\]s are in C~l)ital letters, and termimtls are in 
lower ea~u,.. '£1.,e lh'zt r~le i~ treed for initialization and lmn-- 
dling of tim delinfitez' symbol $. The $ delimiters are implicit 
in ~:b,~., r~e~aal input sentenc¢~. 
(4) itP : :~ de~ n 
(5) ~P : :~ t~P PP 
(6) '?P ::~, pr(~p hip 
(7) VP ::,~ v ~P 
This inpn:; eo~'re~pondu (for example) to the sentence: 
*~:i: ea.,\] a ~lan wi~h a mirror" 
:~ALY~:t:S \[IF: (~ v do'~ ~ prep dot zt) 
.,&oii {71*~t~;'~x~; gr~a~.~iar in:educed by the parser 
The gr~J~o~,~,~r output bg the paxser is given in figure 2. The 
initial nol~te~mhLM is ~he left-hand side of the fh'st rule. l~br 
re~l~l)i\]i~;:~ t, he nonternfi:mfl/items have bemn given computer 
g*'xte~n.t(~/names, (ff the fens at.x, where :c is an integer. At this 
point we. have forgotten ~he ixdermd structm'e of the items corre- 
• spending ~o ¢~heix' x'o\]e in the pa.~sing process. All other symbols 
are ternfi~M. Integer terminals correspond to rule numbers of 
the input language grammar (-~ (see. section A.1 above), and the 
othe," tex'Jx,hm\]f~ are symboh~ of the parsed language, i.e. symbols 
in ~\]. Not, ~. the ~.mbig~ity fi)r nonterminM at;3. 
nt0 ::= ntl nt2 ntl4 ::= det 
ntl ::= $ ntl5 ::= n 
at2 ::= at3 nt28 ntl6 ::= ntl7 6 
nt3 ::= nt4 2 ntl7 ::= ntl8 ntl9 
nt3 ::= nt23 1 ntl8 ::=prep 
nt4 ::= nt5 ntl6 ntl9 ::= nt20 4 
nt5 ::= nt6 1 nt20 ::= nt21 nt22 
at6 ::= at7 nt9 nt21 ::= dot 
nt7 ::= at8 3 nt22 ::= n 
at8 ::= n at23 ::= nt7 nt24 
at9 ::= ntl0 7 nt24 ::= nt25 7 
ntl0 ::= ntll nt12 at25 ::= ntll nt26 
ntll :::= v nt26 ::= nt27 5 
ntl2 ::= ntl3 4 nt27 ::= ntl2 ntl6 
ntl3 ::= ntl4 ntl5 nt28 ::= $ 
Figm'e 2: The output grammar. 
0 
1 2 
4 23 
i i- J I 
11 12~,4 11\] 19~4 
v I ,,r,,,,, I 
13 20 
14 15 21 22 
det rl det n 
Figure 3: Graph of the output grammar. 
NP 
4 
v det n 
pp 
6 
prep det n 
Figure 4: The shared parse forest 
369 
Ao4 Simplified output grammar 
This is a simplified form of the grammar in which some of the 
structm'e that makes it readable as a shared-forest has been lost 
(though it could be retrieved). However it preserves all sharing 
of common subparses. This is the justification for having so 
many rules, while only 2 parse sequences may be generated by 
that grarmnar. 
ntO ::= $ nt3 $ 
nt8 ::= nt7 nt11 nt12 7 1 nt16 2 
nt3 ::= nt7 ntll ntl2 nt16 5 7 1 
nt7 : := n 3 
nt11 : := v 
nt12 ::= det n 4 
nt16 ::= prep det n 4 6 
The 2 parses of the input, which are defined by this gram- 
maI'~ are: 
$ n 3 v det n 4 7 1 prep det n 4 6 2 $ 
$ n 3 v det n 4 prep det n 4 6 5 7 1 $ 
Here again the 2 symbols $ must be read as delimiters. 
A.5 Parse forest built from that grammar 
To explain the construction of the shared forest, we first build 
in figure 3 a graph from the grammar of section A.3. Here the 
graph is acyclic, but with an incomplete input, it could have 
cycles. Each node corresponds to one terminal or nonterminal 
of the grammar in section A.3, and is labeled by it. The labels 
at the right of small dashes are input grammar rule nmnbers 
(eft section A.1). Note the ambiguity of node nt3 represented 
by an ellipse joining the two possible parses. 
From the graph of figure 3, we can trivially derive tim shared- 
forest given in figure 4. 
For readability, we present this shared-forest in a simplified 
forra. Actually the sons of a node need sometimes to be repre- 
sented as a binary Lisp like list, so as to allow proper sharing 
of some of the sons. Each node includes a label which is a non- 
terminal of the grammar Q, and for each possible derivation 
(several in case of ambiguity, e.g. the top node of figure 4) there 
is the number of the grammar rule used for that derivation. 
The constructions in this section are purely virtual, and 
are not actually necessary in an implementation. The data- 
structure representing the grammar of section A.3 may be di- 
rectly interpreted and used as a shared-forest. 
B Example with an unknown input 
subsequence 
B.1 Grammar of the analyzed language 
The grammar is the same as in appendix A. 
1-3o2 Input sentence 
This input corresponds (for example) to the sentence: 
~... SaW , . . mirror ~ 
where the first "..." are known to be one word, and the last 
".. o" may be any number of words, i.e.: 
ANALYSIS OF: (? v * n) 
B.3 Output grammar produced bythe parser 
Note that the nodes that derive on (several) symbol(s) • have 
been replaced by * for simplification as indicated at the end of 
section 4. This explnins the gaps in the numbering of nonter- 
minals. 
I 
Figure 5: Shared=forest for an incomplete Sentence. 
s 
2 
s PP 
2 
n NP PP prep 
\ 
NP 
5 /'/: 
NP prep det 
Figure 6: A parse tree extracted from the forest. 
ntO ::~ ntl nt2 nt26 ::~ * nt27 
ntl ::- $ nt27 ::ffi nt28 
nt2 ::ffi nt3 nt38 nt27 ::~ nt32 5 
nt3 ::~ nt4 2 nt28 ::~ nt29 4 
nt3 ::ffi nt33 i nt28 ::- nt31 3 
nt4 ::= nt5 nt25 nt29 ::- * n~30 
nt5 ::ffi nt6 2 nt30 ::- n 
nt5 ::- nt17 1 nt31 ::- n 
nt6 ::- nt5 * nt32 ::- * nt25 
nt17 ::- nt18 nt20 nt33 ::u nt18 nt34 
nt18 ::- nt19 3 nt34 ::- nt35 7 
ntl9 ::- ? nt35 ::- nt22 nt36 
nt20 ::- nt21 7 nt36 ::- nt28 
nt21 ::ffi nt22 * nt36 ::- nt37 5 
nt22 ::- v n¢37i::- * n¢~5 
n¢25 ::- n¢26 6 n¢38i::- $ 
B.4 Simplified output grammar 
ntO ::= $ nt3 $ nt28 ::i ~ * n 4 
nt3 ::= nt5 nt25 2 nt28 ::~ n 3 
nt3 ::= ntl8 nt22 nt36 7 I nt25 ::~ * nt27 6 
n¢5 ::- n¢5 * 2 ntl8 ::~ ? 3 
nt5 ::= ntl8 nt22 * 7 I nt22 ::u v 
nt27 ::= nt28 nt36 ::= nt28 
nt27 ::- * nt25 5 nt36 ::~ * nt25 5 
370 
A parse of the input, chosen in the infinite set of possible 
parses defined by this grammar, is the following (see figure 6): 
$ ? 8 v* 7 1. 2 ** ** a46 5 62 $ 
This itt not ~'eally a complete parse since, due to the first sim- 
plification of the grammar, some * symbols stand for a missing 
nontermil~d, i.e. for any parse of a string derived from this 
nontermil~d. For example the first • stand for the nontermlnal 
Np and cmdd be replaced by "* 3" or by "* * 4 * * 3 6 5". 
B,5 Parse shared-forest built from that gram- 
I~laF 
The outpu~ grammars given above are not optimal with respect 
to sharing. Mainly the nonterminals nt27 and st36 should be 
the same (they do generate the same parse fragments). Also 
the .terminal n should appear only once. We give in figure 5 
a stmred-ibrest corresponding to this grammar, build as in the 
previo~ example of appendix A, were we have improved the 
shax'ing by merging at27 mxd st36 so as to improve readability. 
We do not give the intermediate graph representing tha output 
grannnar us we did in appendix A. 
Our implementation is currently being improved to directly 
achieve better sharing. 
In figure 6 we give one parse-tree extracted from the shared- 
forest of fig~rc 5. it corresponds to the parse sequence given as 
example in scction B.4 above. Note that, like the corresponding 
parse sequence, this is not a complete parse tree, since it Ires 
nontermir~\]s labeling its leaves. A complete parse tree may be 
obtained by completing arbitrarily these leaves according to the 
original grv.mmar of the language as defined in section A.1. 
C The algorithm 
The length of this algorithm is due to its generality. Fewer types 
of transitions axe usually needed with specific implementations, 
typically only one for scanning transitions. 
Coxmneats are prefixed with "--". 
.... Begin parse with input sequence x of length n 
~e~A: -- Initialization 
:=: o o o o ((q$ O) (q$ 0)); 
:::: (0 ~ e); 
So ::~ {6}; 
"p :=: {~}; 
i :=: O; 
step-B: -- Iteration 
-- initial item 
-- first rule of output grammar 
-- initialize item-set ,.go 
-- rules of output grammar 
--- input-scanner index is set 
-- before the first input symbol 
loop -- while i < n (el, exit in step-B.$) 
if xi+t # * 
~tepoB.l: -- Normal completion of item-set St 
--- with non-scanning transitions. 
:l:or nve.vy item U = ((pAi)(ql~j)) in 8/ do 
~:or avery noa-scanuing transltion r in $ do 
we distinguish five cases, according to r: 
~: -~ stack-flee transition 
if r=(pee ~-~ rez) 
then V := ((rAi)(qBj)); 
& := &u{V}; 
v := v u {(v --, uz)}; 
~: -- push transition 
if r=(pee ~-) rcz) 
then V := ((rCi)(pAi)); 
s, := &u{v}; 
v := v u ((v --, z)}; 
case-B.l,3: -- pop transition 
if r=(pAe ~-+ rex) 
then 
~or every item Y = ((q B j) (s D k)) in Sj 
do V := ((r n i) (s D k)) ; 
& := &u{v); 
V := PU {(V --+ YCz)}; 
case-B.l.~: -- pop-push transition 
if r=(pAe ~ rCz) 
then V := ((rCi)(qBj)); 
& := &u{V}; 
v := v u {(v -~ Uz)}; 
case-B.1.~: 
-- Other non-scanning transitions are ignored 
else --~. 
--- t.e. the next input symbol 
- is the unknown subsequenee: 
step-B*.h -- Completion of item-set Si 
-- with non-scanning transitions 
-- and with dummy scanning transitions. 
--- This step is similar to step-B. 1, 
-- but considering all transitions as non-scanning. 
for every item U=((pAi)(qBj)) in Si do 
for every transition v in 6 do 
-- we distinguish five eases, according to r: 
case-B*.1.~: 
i~ ~ =(r,~e ~ re~) o~- ~ =(p~.~ ~ ~'~) 
then V := ((r A i) (q 13 j)); 
,s', := &u{v); 
p := ~, u {(v -, u,)}; 
-- and so on as in step.B.l 
step-B.2: -- Exit for main loop 
if i = n then exit loop; ~- go to step-C 
h := i+1; 
while Xh=* do h := h+l; 
step-B.3: -- Initialization of item-set Sh 
&:=¢; 
for every item u = ((p A i) (q B j)) in e do 
for every scanning transition r in ~ do 
-- Proceed by eases as in step.B.1, 
-- but with scanning transitions, and 
-- adding the new items to Sh instead of St. 
--- See for example the following case: 
fase-B.$.2: 
if r=(pea ~-~ rcz) with xh =a or xh=? 
then V := ((r C h) (p A i)) ; 
& := &u{v}; 
~, := ,, u {(v -, z)}; 
~: -- Inerementation of scanning index i 
i := h; 
end loop; 
step-C: -- Termination 
for every item U =: ((f t n) (q $ O)) in an 
such that f 6 F do 
7 :) := 7 ~ U (Uf --~ U) ; -- Ut is the initial nonterminal of 9. 
-- End of parse 
371 

References 

\[1\] Aho, A.V.; Sethi, R.; and Ullman, J.D. 1986 Com- 
pilers -- Principles, Techniques and Tools. Addison- 
Wesley. 

\[2\] Aho, A.V.; and Ullman, J.D. 1972 The Theory of 
Parsing, Translation and Compiling. Prentice-Hall, 
Englewood Cliffs, New Jersey. 

\[3\] Billot, S. 1986 Analyse Syntaxique Non-D~terministe. 
Rapport DEA, Universit~ d'Orl~ans la Source, and 
INRIA, France. 

\[4\] Billot, S.; and Lang, B. 1988 The structure of Shared 
: Forests in Ambiguous Parsing. In preparation. 

\[5\] Bouckaert, M.; Pirotte, A.; and Snelling, M. 1975 Ef- 
ficient Parsing Algorithms for General C0ntext-Free 
Grammars. Information Sciences 8(1): 1-26. 

\[6\] Coeke, J.; and Schwartz, J.T. 1970 Programming 
Languages and Their Compilers. Courant Institute 
of Mathematical Sciences, New York University, New 
York. 

\[7\] Coppersmith, D.; and Winograd, S. 1982 On the 
Asymptotic Complexity of Matrix Multiplication. 
SIAM Journal on Computing, 11(3): 472-492. 

\[8\] DeRemer, F.L. 1971 Simple LR(k) Grammars. Com- 
munications ACM 14(7): 453-460. 

\[9\] Donzeau-Gouge, V.; Dubois, C.; Facon, P.; and Jean 
F. 1987 Development of a Programming Environment 
for SETL. ESEC'87, Proc. of the 1 "t European Soft- 
ware Engineering Conference, Strasbourg (France), 
pp. 23-34. 

\[10\] Earley, J. 1970 An Efficient Context-Free Parsing Al- 
gorithm. Communications ACM 13(2): 94-102. 

\[ll\]¢Graham, S.L.; Harrison, M.A.; and Ruzzo W.L. 
1980 An Improved Context-Free Recognizer. ACId 
Transactions on Programming Languages arid Sys- 
tems 2(3): 415-462. 

\[12\] Griffiths, I.; an(l Petrick, S. 1965 On the Relative Effi- 
ciencies of Context-Frec Grammar Recognizers. Com- 
munications A CM 8(5): 289-300. 

\[13\] Hays, D.G. 1962 Automatic Language-Data Process- 
ing. In Computer Applications in the Behavioral Sci- 
ences, (H. Borko ed.), Prentice-Hall, pp. 394-423. 

\[14\] Ichbiah, J.D.; and Morse, S.P. 1970 A Technique 
for Generating Almost Optimal Floyd-Evans Pro- 
ductions for Precedence Granmaars. Communications 
A CM 13(8): 501-508. 

\[15\] Kasami, J. 1965 An E~cient Recognition and Syn- 
tax Analysis Algorithm for Context-Free Languages. 
Report of Univ. of Hawaii, also AFCRL-65-758, 
Air Force Cambridge Research Laboratory, Bedford 
(Massachusetts), also 1966, University of Illinois Co- 
ordinated Science Lab. Report, No. R-257. 

\[16\] Kay, M. 1980 Algorithm Schemata and Data Struc- 
tures in Syntactic Processing. Proceedings of the No- 
bel Symposium on Text Processing, Gothenburg. 

\[17\] Knuth, D.E. 1965 On the Translation of Languages 
from Left to Right. Information and Control, 8: 607- 
639. 

\[18\] Long, B. 1974 Deterministic Techniques for Efficient 
Non-deterministic Parsers. Proc. of the 2 na Collo- 
quium on Automata, Languages and Programming, 
J. Loeckx (ed.), S~rbrficken, Springer Lecture Notes 
in Computer Science 14: 255-269. 
Also: Rapport de Recherche 72, IRIA-Laboria, Roc- 
queneour t (France). 

\[19\] Long, B. 1988 Datalog Automata. To appear in Proc. 
of the 3 rd Internat. Conf. on Data and Knowledge 
Bases, Jerusalem (Israel). 

\[20\] Long, B. 1988 Complete Evaluation of Horn Clauses, 
an Automata Theoretic Approach. In preparation. 

\[21\] Li, T.; and Chun, H.W. 1987 A Massively Parallel 
Network-Based Natural Language Parsing System. 
Proc. of 2 nd Int. Conf. on Computers and Applica- 
tions Beijing (Peking), : 401-408. 

\[22\] Nakagawa, S. 1987 Spoken Sentence Recognition by 
Time-Synchronous Parsing Algorithm of Context- 
Free Grammar'. Proc. ICASSP 87, Dallas (Texas), 
Vol. 2 : 829-832. 

\[23\] Phillips, J.D. 1986 A Simple Efficient Parser for 
Phrase-Structure Grammars. Quarterly Newsletter 
of the Soc. for the Study of Artificial Intelligence 
(AISBQ) 59: 14-19. 

Pratt, V.R. 1975 LINGOL -- A Progress Report. In 
Proceedings Of the 4th IJCAI: 422-428. 

Shell, B.A. 1976 Observations on Context Free Pars- 
ing. in Statistical Methods in Linguistics: 71-109, 
Stockholm (Sweden), Proe. of Internat. Conf. on 
Computational Linguistics (COLING-76), Ottawa (Canna). 
Also: Technical Report TR 12-76, Ceat~r f~ Re- 
search in Computing T~mology, Aik~ Ccmaputa~ 
tion Laboratory, Harvard Univ., Cambr~ (M~- 
sachusetts). 

Shiebcr, S.M. 1985 Using Restriction to Extend 
Parsing Algorithms for Complex-Feature-Based For- 
malisms. Proceedings of the 23 ,~ Annual Meeting of 
the Association for Computational 15inguistics: 145-152. 

\[27\] '£omita, M. 1986 An Efficient Word Lattice Pars- 
ing Algorithm fox" Continuous Speech Recognition. In 
Proceedings of IEEE-17~CE~ASJ I~terua~ional Con- 
.fereuee on Aco~tstlc,, Speech, and Signal Processing 
(ICASSP 86), Vol. 3: 1.569-1572. 

\[28\] Tomita, M. 1987 An Efficient Augmented-Context- 
?roe P~.rsing Algoxithm. Compufational Lingui~tica l:~(1.2): :~-~6. 

\[29\] geliant, L.G. 1975 GenerM Context-Free Recognition 
~n Le~ than Cubic Time. dournM o$ Computer and 
3ystcm Sc~en~:ea, 10: 308-315. 

\[3i~\] 'gotmger, D.~. 1967 Recognition and Parsing of 
l\]ontext-Free Language~ in Time n 3. litformafion and 
9outrol, 10(2): 189-208 
