Efficient Parsing for Bilexical Context-Free Grammars 
and Head Automaton Grammars* 
Jason Eisner 
Dept. of Computer ~ Information Science 
University of Pennsylvania 
200 South 33rd Street, 
Philadelphia, PA 19104 USA 
j eisner@linc, cis. upenn, edu 
Giorgio Satta 
Dip. di Elettronica e Informatica 
Universit£ di Padova 
via Gradenigo 6/A, 
35131 Padova, Italy 
satt a@dei, unipd, it 
Abstract 
Several recent stochastic parsers use bilexical 
grammars, where each word type idiosyncrat- 
ically prefers particular complements with par- 
ticular head words. We present O(n 4) parsing 
algorithms for two bilexical formalisms, improv- 
ing the prior upper bounds of O(n5). For a com- 
mon special case that was known to allow O(n 3) 
parsing (Eisner, 1997), we present an O(n 3) al- 
gorithm with an improved grammar constant. 
1 Introduction 
Lexicalized grammar formalisms are of both 
theoretical and practical interest to the com- 
putational linguistics community. Such for- 
malisms specify syntactic facts about each word 
of the language--in particular, the type of 
arguments that the word can or must take. 
Early mechanisms of this sort included catego- 
rial grammar (Bar-Hillel, 1953) and subcatego- 
rization frames (Chomsky, 1965). Other lexi- 
calized formalisms include (Schabes et al., 1988; 
Mel'~uk, 1988; Pollard and Sag, 1994). 
Besides the possible arguments of a word, a 
natural-language grammar does well to specify 
possible head words for those arguments. "Con- 
vene" requires an NP object, but some NPs are 
more semantically or lexically appropriate here 
than others, and the appropriateness depends 
largely on the NP's head (e.g., "meeting"). We 
use the general term bilexical for a grammar 
that records such facts. A bilexical grammar 
makes many stipulations about the compatibil- 
ity of particular pairs of words in particular 
roles. The acceptability of "Nora convened the 
" The authors were supported respectively under ARPA 
Grant N6600194-C-6043 "Human Language Technology" 
and Ministero dell'Universitk e della Ricerca Scientifica 
e Tecnologica project "Methodologies and Tools of High 
Performance Systems for Multimedia Applications." 
party" then depends on the grammar writer's 
assessment of whether parties can be convened. 
Several recent real-world parsers have im- 
proved state-of-the-art parsing accuracy by re- 
lying on probabilistic or weighted versions of 
bilexical grammars (Alshawi, 1996; Eisner, 
1996; Charniak, 1997; Collins, 1997). The ra- 
tionale is that soft selectional restrictions play 
a crucial role in disambiguation, i 
The chart parsing algorithms used by most of 
the above authors run in time O(nS), because 
bilexical grammars are enormous (the part of 
the grammar relevant to a length-n input has 
size O(n 2) in practice). Heavy probabilistic 
pruning is therefore needed to get acceptable 
runtimes. But in this paper we show that the 
complexity is not so bad after all: 
• For bilexicalized context-free grammars, 
O(n 4) is possible. 
• The O(n 4) result also holds for head au- 
tomaton grammars. 
• For a very common special case of these 
grammars where an O(n 3) algorithm was 
previously known (Eisner, 1997), the gram- 
mar constant can be reduced without 
harming the O(n 3) property. 
Our algorithmic technique throughout is to pro- 
pose new kinds of subderivations that are not 
constituents. We use dynamic programming to 
assemble such subderivations into a full parse. 
2 Notation for context-free 
grammars 
The reader is assumed to be familiar with 
context-free grammars. Our notation fol- 
1Other relevant parsers simultaneously consider two 
or more words that are not necessarily in a dependency 
relationship (Lafferty et al., 1992; Magerman, 1995; 
Collins and Brooks, 1995; Chelba and Jelinek, 1998). 
457 
lows (Harrison, 1978; Hopcroft and Ullman, 
1979). A context-free grammar (CFG) is a tuple 
G = (VN, VT, P, S), where VN and VT are finite, 
disjoint sets of nonterminal and terminal sym- 
bols, respectively, and S E VN is the start sym- 
bol. Set P is a finite set of productions having 
the form A --+ a, where A E VN, a E (VN U VT)*. 
If every production in P has the form A -+ BC 
or A --+ a, for A,B,C E VN,a E VT, then the 
grammar is said to be in Chomsky Normal Form 
(CNF). 2 Every language that can be generated 
by a CFG can also be generated by a CFG in 
CNF. 
In this paper we adopt the following conven- 
tions: a, b, c, d denote symbols in VT, w, x, y de- 
note strings in V~, and a, ~,... denote strings 
in (VN t_J VT)*. The input to the parser will be a 
CFG G together with a string of terminal sym- 
bols to be parsed, w = did2.., dn. Also h,i,j,k 
denote positive integers, which are assumed to 
be ~ n when we are treating them as indices 
into w. We write wi,j for the input substring 
di'." dj (and put wi,j = e for i > j). 
A "derives" relation, written =~, is associated 
with a CFG as usual. We also use the reflexive 
and transitive closure of o, written ~*, and 
define L(G) accordingly. We write a fl 5 =~* 
a75 for a derivation in which only fl is rewritten. 
3 Bilexical context-free grammars 
We introduce next a grammar formalism that 
captures lexical dependencies among pairs of 
words in VT. This formalism closely resem- 
bles stochastic grammatical formalisms that are 
used in several existing natural language pro- 
cessing systems (see §1). We will specify a non- 
stochastic version, noting that probabilities or 
other weights may be attached to the rewrite 
rules exactly as in stochastic CFG (Gonzales 
and Thomason, 1978; Wetherell, 1980). (See 
§4 for brief discussion.) 
Suppose G = (VN, VT, P,T\[$\]) is a CFG in 
CNF. 3 We say that G is bilexical iff there exists 
a set of "delexicalized nonterminals" VD such 
that VN = {A\[a\] : A E VD,a E VT} and every 
production in P has one of the following forms: 
2Production S --~ e is also allowed in a CNF grammar 
if S never appears on the right side of any production. 
However, S --+ e is not allowed in our bilexical CFGs. 
,awe have a more general definition that drops the 
restriction to CNF, but do not give it here. 
• A\[a\] ~ B\[b\] C\[a\] (1) 
• A\[a\] --+ C\[a\] B\[b\] (2) 
• A\[a\] ~ a (3) 
Thus every nonterminal is lexicalized at some 
terminal a. A constituent of nonterminal type 
A\[a\] is said to have terminal symbol a as its lex- 
ical head, "inherited" from the constituent's 
head child in the parse tree (e.g., C\[a\]). 
Notice that the start symbol is necessarily a 
lexicalized nonterminal, T\[$\]. Hence $ appears 
in every string of L(G); it is usually convenient 
to define G so that the language of interest is 
actually L'(G) = {x: x$ E L(G)}. 
Such a grammar can encode lexically specific 
preferences. For example, P might contain the 
productions 
• VP \[solve\] --+ V\[solve\] NP\[puzzles\] 
• NP\[puzzles\] --+ DEW\[two\] N\[puzzles\] 
• V\[solve\] ~ solve 
• N\[puzzles\] --4 puzzles 
• DEW\[two\] --+ two 
in order to allow the derivation VP\[solve\] ~* 
solve two puzzles, but meanwhile omit the sim- 
ilar productions 
• VP\[eat\] -+ V\[eat\] NP\[puzzles\] 
• VP\[solve\] --~ V\[solve\] NP\[goat\] 
• VP\[sleep\] -+ V\[sleep\] NP\[goat\] 
• NP\[goat\] -+ DET\[two\] N\[goat\] 
since puzzles are not edible, a goat is not solv- 
able, "sleep" is intransitive, and "goat" cannot 
take plural determiners. (A stochastic version 
of the grammar could implement "soft prefer- 
ences" by allowing the rules in the second group 
but assigning them various low probabilities.) 
The cost of this expressiveness is a very large 
grammar. Standard context-free parsing algo- 
rithms are inefficient in such a case. The CKY 
algorithm (Younger, 1967; Aho and Ullman, 
1972) is time O(n 3. IPI), where in the worst case 
IPI = \[VNI 3 (one ignores unary productions). 
For a bilexical grammar, the worst case is IPI = 
I VD 13. I VT 12, which is large for a large vocabulary 
VT. We may improve the analysis somewhat by 
observing that when parsing dl ... dn, the CKY 
algorithm only considers nonterminals of the 
form A\[di\]; by restricting to the relevant pro- 
ductions we obtain O(n 3. IVDI 3. min(n, IVTI)2). 
458 
We observe that in practical applications we 
always have n << IVTI. Let us then restrict 
our analysis to the (infinite) set of input in- 
stances of the parsing problem that satisfy re- 
lation n < IVTI. With this assumption, the 
asymptotic time complexity of the CKY algo- 
rithm becomes O(n 5. IVDt3). In other words, 
it is a factor of n 2 slower than a comparable 
non-lexicalized CFG. 
4 Bilexical CFG in time O(n 4) 
In this section we give a recognition algorithm 
for bilexical CNF context-free grammars, which 
runs in time O(n 4. max(p, IVDI2)) = O(n 4. 
IVDI3). Here p is the maximum number of pro- 
ductions sharing the same pair of terminal sym- 
bols (e.g., the pair (b, a) in production (1)). The 
new algorithm is asymptotically more efficient 
than the CKY algorithm, when restricted to in- 
put instances satisfying the relation n < IVTI. 
Where CKY recognizes only constituent sub- 
strings of the input, the new algorithm can rec- 
ognize three types of subderivations, shown and 
described in Figure l(a). A declarative specifi- 
cation of the algorithm is given in Figure l(b). 
The derivability conditions of (a) are guaran- 
teed by (b), by induction, and the correctness of 
the acceptance condition (see caption) follows. 
This declarative specification, like CKY, may 
be implemented by bottom-up dynamic pro- 
gramming. We sketch one such method. For 
each possible item, as shown in (a), we maintain 
a bit (indexed by the parameters of the item) 
that records whether the item has been derived 
yet. All these bits are initially zero. The algo- 
rithm makes a single pass through the possible 
items, setting the bit for each if it can be derived 
using any rule in (b) from items whose bits are 
already set. At the end of this pass it is straight- 
forward to test whether to accept w (see cap- 
tion). The pass considers the items in increas- 
ing order of width, where the width of an item 
in (a) is defined as max{h,i,j} -min{h,i,j}. 
Among items of the same width, those of type 
A should be considered last. 
The algorithm requires space proportional to 
the number of possible items, which is at most 
na\]VDI 2. Each of the five rule templates can 
instantiate its free variables in at most n4p or 
(for COMPLETE rules) n41VDI 2 different ways, 
each of which is tested once and in constant 
time; so the runtime is O(n 4 max(p, IVDI2)). 
By comparison, the CKY algorithm uses only 
the first type of item, and relies on rules whose 
B C 
inputs are pairs .~.~ . z~::~ . Such rules 
can be instantiated in O(n 5) different ways for a 
fixed grammar, yielding O(n 5) time complexity. 
The new algorithm saves a factor of n by com- 
bining those two constituents in two steps, one 
of which is insensitive to k and abstracts over its 
possible values, the other of which is insensitive 
to h ~ and abstracts over its possible values. 
It is straightforward to turn the new O(n 4) 
recognition algorithm into a parser for stochas- 
tic bilexical CFGs (or other weighted bilexical 
CFGs). In a stochastic CFG, each nonterminal 
A\[a\] is accompanied by a probability distribu- 
tion over productions of the form A\[a\] --+ ~. A 
T 
is just a derivation (proof tree) of lZ~n ,.o parse 
and its probability--like that of any derivation 
we find--is defined as the product of the prob- 
abilities of all productions used to condition in- 
ference rules in the proof tree. The highest- 
probability derivation for any item can be re- 
constructed recursively at the end of the parse, 
provided that each item maintains not only a 
bit indicating whether it can be derived, but 
also the probability and instantiated root rule 
of its highest-probability derivation tree. 
5 A more efficient variant 
We now give a variant of the algorithm of §4; the 
variant has the same asymptotic complexity but 
will often be faster in practice. 
Notice that the ATTACH-LEFT rule of Fig- 
ure l(b) tries to combine the nonterminal label 
B\[dh,\] of a previously derived constituent with 
every possible nonterminal label of the form 
C\[dh\]. The improved version, shown in Figure 2, 
restricts C\[dh\] to be the label of a previously de- 
rived adjacent constituent. This improves speed 
if there are not many such constituents and we 
can enumerate them in O(1) time apiece (using 
a sparse parse table to store the derived items). 
It is necessary to use an agenda data struc- 
ture (Kay, 1986) when implementing the declar- 
ative algorithm of Figure 2. Deriving narrower 
items before wider ones as before will not work 
here because the rule HALVE derives narrow 
items from wide ones. 
459 
(a) 
A i4 , 
A 
A 
h z j 
(i g h <j, A E VD) 
(i < j <h,A, C E VD) 
(h < i < j, A, C E VD) 
is derived iff A\[dh\] ~* wi,j 
is derived iff A\[dh\] ~ B\[dh,\]C\[dh\] ~* wi,jC\[dh\] for some B, h' 
is derived iff A\[dh\] ~ C\[dh\]B\[dh,\] ~* C\[dh\]wi,j for some B, h' 
(b) STAaT: ~ A\[dh\] ~ dh 
h@h 
ATTACH-LEFT: B 
A 
./Q". c 
~ 3 h ATTACH-RIGHT: B 
.4 
h ~ 3 
A\[dh\] -~ B\[dh,\]C\[dh\] 
A\[dh\] -~ C\[dh\]B\[dh,\] 
COMPLETE-RIGHT: 
COMPLETE-LEFT: 
A C 
3 h j 
A iz k 
C A 
A iz@k 
Figure 1: An O(n 4) recognition algorithm for CNF bilexical CFG. (a) Types of items in the 
parse table (chart). The first is syntactic sugar for the tuple \[A, A, i, h,j\], and so on. The stated 
conditions assume that dl,...dn are all distinct. (b) Inference rules. The algorithm derives the 
item below -- if the items above -- have already been derived and any condition to the right 
of is met. It accepts input w just if item I/k, T, 1, h, n\] is derived for some h such that dh -= $. 
(a) 
A 
A 
i//\]h ( i <_ h, A e VD) 
A 
h~ (h < j, A E VD) 
,~. ~C (i _< j < h, A,C E VD) 3 h 
A 
A 
C~. (h < i < j, A,C E VD) h ~ 3 
(i < h _< j, A E VD) is derived iff A\[dh\] ~* wi,j 
is derived iff A\[dh\] ~* wi,j for some j _> h 
is derived iff A\[dh\] ~* w~,j for some i _< h 
is derived iff A\[dh\] ~ B\[dh,\]C\[dh\] ~* wi,jC\[dh\] ~* wi,k for 
some B, h ~, k 
is derived iff A\[dh\] ~ C\[dh\]B\[dh,\] ~* C\[dh\]wi,j ~* Wk,j for 
some B, h ~, k 
(b) As in Figure l(b) above, but add HALVE and change ATTACH-LEFT and ATTACH-RIGHT as shown. 
HALVE: ATTACH-LEFT: ATTACH-RIGHT: 
A B C C B 
A A A A\[dh\] ---4 B\[dh,\]V\[dh\] d d\[dh\] ---+ C\[dh\]B\[dh,\] 
Figure 2: A more efficient variant of the O(n 4) algorithm in Figure 1, in the same format. 
460 
6 Multiple word senses 
Rather than parsing an input string directly, it 
is often desirable to parse another string related 
by a (possibly stochastic) transduction. Let T 
be a finite-state transducer that maps a mor- 
pheme sequence w E V~ to its orthographic re- 
alization, a grapheme sequence v~. T may re- 
alize arbitrary morphological processes, includ- 
ing affixation, local clitic movement, deletion 
of phonological nulls, forbidden or dispreferred 
k-grams, typographical errors, and mapping of 
multiple senses onto the same grapheme. Given 
grammar G and an input @, we ask whether 
E T(L(G)). We have extended all the algo- 
rithms in this paper to this case: the items sim- 
ply keep track of the transducer state as well. 
Due to space constraints, we sketch only the 
special case of multiple senses. Suppose that 
the input is ~ =dl ... dn, and each di has up to 
• g possible senses. Each item now needs to track 
its head's sense along with its head's position in 
@. Wherever an item formerly recorded a head 
position h (similarly h~), it must now record a 
pair (h, dh) , where dh E VT is a specific sense of 
d-h. No rule in Figures 1-2 (or Figure 3 below) 
will mention more than two such pairs. So the 
time complexity increases by a factor of O(g2). 
7 Head automaton grammars in 
time O(n 4) 
In this section we show that a length-n string 
generated by a head automaton grammar (A1- 
shawi, 1996) can be parsed in time O(n4). We 
do this by providing a translation from head 
automaton grammars to bilexical CFGs. 4 This 
result improves on the head-automaton parsing 
algorithm given by Alshawi, which is analogous 
to the CKY algorithm on bilexical CFGs and is 
likewise O(n 5) in practice (see §3). 
A head automaton grammar (HAG) is a 
function H : a ~ Ha that defines a head au- 
tomaton (HA) for each element of its (finite) 
domain. Let VT =- domain(H) and D = {~, +-- 
-}. A special symbol $ E VT plays the role of 
start symbol. For each a E VT, Ha is a tuple 
(Qa, VT, (~a, In, Fa), where 
• Qa is a finite set of states; 
4Translation in the other direction is possible if the 
HAG formalism is extended to allow multiple senses per 
word (see §6). This makes the formalisms equivalent. 
• In, Fa C Qa are sets of initial and final 
states, respectively; 
• 5a is a transition function mapping Qa x 
VT × D to 2 Qa, the power set of Qa. 
A single head automaton is an acceptor for a 
language of string pairs (z~, Zr) E V~ x V~. In- 
formally, if b is the leftmost symbol of Zr and 
q~ E 5a(q, b, -~), then Ha can move from state q 
to state q~, matching symbol b and removing it 
from the left end of Zr. Symmetrically, if b is the 
rightmost symbol of zl and ql E 5a(q, b, ~---) then 
from q Ha can move to q~, matching symbol b 
and removing it from the right end of zl.5 
More formally, we associate with the head au- 
tomaton Ha a "derives" relation F-a, defined as 
a binary relation on Qa × V~ x V~. For ev- 
ery q E Q, x,y E V~, b E VT, d E D, and 
q' E ~a(q, b, d), we specify that 
(q, xb, y) ~-a (q',x,Y) if d =+-; 
(q, x, by) ~-a (q', x, y) if d =--+. 
The reflexive and transitive closure of F-a is writ- 
ten ~-~. The language generated by Ha is the set 
L(Ha) = {<zl,Zr) I (q, zl,Zr) I-; (r,e,e), 
qEIa, rEFa}. 
We may now define the language generated 
by the entire grammar H. To generate, we ex- 
pand the start word $ E VT into xSy for some 
(x, y) E L(H$), and then recursively expand the 
words in strings x and y. More formally, given 
H, we simultaneously define La for all a E VT 
to be minimal such that if (x,y) E L(Ha), 
x r E Lx, yl ELy, then x~ay ~ E La, where 
Lal...ak stands for the concatenation language 
Lal "'" La k. Then H generates language L$. 
We next present a simple construction that 
transforms a HAG H into a bilexical CFG G 
generating the same language. The construc- 
tion also preserves derivation ambiguity. This 
means that for each string w, there is a linear- 
time 1-to-1 mapping between (appropriately de- 
~Alshawi (1996) describes HAs as accepting (or equiv- 
alently, generating) zl and z~ from the outside in. To 
make Figure 3 easier to follow, we have defined HAs as 
accepting symbols in the opposite order, from the in- 
side out. This amounts to the same thing if transitions 
are reversed, Is is exchanged with Fa, and any transi- 
tion probabilities are replaced by those of the reversed 
Markov chain. 
461 
fined) canonical derivations of w by H and 
canonical derivations of w by G. 
We adopt the notation above for H and the 
components of its head automata. Let VD be 
an arbitrary set of size t = max{\[Qa\[ : a • VT}, 
and for each a, define an arbitrary injection fa : 
Qa --+ YD. We define G -- (VN, VT, P,T\[$\]), 
where 
(i) VN = {A\[a\] : A • VD, a • VT}, in the usual 
manner for bilexical CFG; 
(ii) P is the set of all productions having one 
of the following forms, where a, b • VT: 
• A\[a\] --+ B\[b\] C\[a\] where 
A = fa(r), B = fb(q'), C = f~(q) for 
some qr • Ib, q • Qa, r • 5a(q, b, +-) 
• A\[a\] -~ C\[a\] Bib\] where 
A = fa(r), B = fb(q'), C = fa(q) for 
some q' • Ib, q • Qa, r • 5a (q, b,--+) 
\] • A\[a --+ a where 
A = fa(q) for some q • Fa 
(iii) T = f$(q), where we assume WLOG that 
I$ is a singleton set {q}. 
We omit the formal proof that G and H 
admit isomorphic derivations and hence gen- 
erate the same languages, observing only that 
if (x,y) = (bib2... bj, bj+l.., bk) E L(Ha)-- 
a condition used in defining La above--then 
g\[a\] 3" BI\[bl\]"" Bj\[bj\]aBj+l\[bj+l\]... Bk\[bk\], 
for any A, B1,... Bk that map to initial states 
in Ha, Hbl,... Hb~ respectively. 
In general, G has p = O(IVDI 3) = O(t3). The 
construction therefore implies that we can parse 
a length-n sentence under H in time O(n4t3). If 
the HAs in H happen to be deterministic, then 
in each binary production given by (ii) above, 
symbol A is fully determined by a, b, and C. In 
this case p = O(t2), so the parser will operate 
in time O(n4t2). 
We note that this construction can be 
straightforwardly extended to convert stochas- 
tic HAGs as in (Alshawi, 1996) into stochastic 
CFGs. Probabilities that Ha assigns to state q's 
various transition and halt actions are copied 
onto the corresponding productions A\[a\] --~ c~ 
of G, where A = fa(q). 
8 Split head automaton grammars 
in time O(n 3) 
For many bilexical CFGs or HAGs of practical 
significance, just as for the bilexical version of 
link grammars (Lafferty et al., 1992), it is possi- 
ble to parse length-n inputs even faster, in time 
O(n 3) (Eisner, 1997). In this section we de- 
scribe and discuss this special case, and give a 
new O(n 3) algorithm that has a smaller gram- 
mar constant than previously reported. 
A head automaton Ha is called split if it has 
no states that can be entered on a +-- transi- 
tion and exited on a ~ transition. Such an au- 
tomaton can accept (x, y) only by reading all of 
y--immediately after which it is said to be in 
a flip state--and then reading all of x. For- 
mally, a flip state is one that allows entry on a 
--+ transition and that either allows exit on a e-- 
transition or is a final state. 
We are concerned here with head automa- 
ton grammars H such that every Ha is split. 
These correspond to bilexical CFGs in which 
any derivation A\[a\] 3" xay has the form 
A\[a\] 3" xB\[a\] =~* xay. That is, a word's left 
dependents are more oblique than its right de- 
pendents and c-command them. 
Such grammars are broadly applicable. Even 
if Ha is not split, there usually exists a split head 
automaton H~ recognizing the same language. 
H a' exists iff {x#y : {x,y) e L(Ha)} is regular 
(where # ¢ VT). In particular, H~a must exist 
unless Ha has a cycle that includes both +-- and 
--+ transitions. Such cycles would be necessary 
for Ha itself to accept a formal language such 
as {(b n, c n) : n > 0}, where word a takes 2n de- 
pendents, but we know of no natural-language 
motivation for ever using them in a HAG. 
One more definition will help us bound the 
complexity. A split head automaton Ha is said 
to be g-split if its set of flip states, denoted 
Qa C_ Qa, has size < g. The languages that can 
be recognized by g-split HAs are those that can 
g be written as \[Ji=l Li x Ri, where the Li and 
Ri are regular languages over VT. Eisner (1997) 
actually defined (g-split) bilexical grammars in 
terms of the latter property. 6 
6That paper associated a product language Li x Ri, or 
equivalently a 1-split HA, with each of g senses of a word 
(see §6). One could do the same without penalty in our 
present approach: confining to l-split automata would 
remove the g2 complexity factor, and then allowing g 
462 
We now present our result: Figure 3 specifies 
an O(n3g2t 2) recognition algorithm for a head 
automaton grammar H in which every Ha is 
g-split. For deterministic automata, the run- 
time is O(n3g2t)--a considerable improvement 
on the O(n3g3t 2) result of (Eisner, 1997), which 
also assumes deterministic automata. As in §4, 
a simple bottom-up implementation will suffice. 
s 
For a practical speedup, add . \["'. as an an- h j 
tecedent to the MID rule (and fill in the parse 
table from right to left). 
Like our previous algorithms, this one takes 
two steps (ATTACH, COMPLETE) to attach a 
child constituent to a parent constituent. But 
instead of full constituents--strings xd~y E 
Ld~--it uses only half-constituents like xdi and 
diy. Where CKY combines z~ i h jj+ln 
we save two degrees of freedom i, k (so improv- 
ing O(n 5) to O(n3)) and combine, ,~:~...~J; n 2J~1 n 
The other halves of these constituents can be at- 
tached later, because to find an accepting path 
for (zl, Zr) in a split head automaton, one can 
separately find the half-path before the flip state 
(which accepts zr) and the half-path after the 
flip state (which accepts zt). These two half- 
paths can subsequently be joined into an ac- 
cepting path if they have the same flip state s, 
i.e., one path starts where the other ends. An- 
notating our left half-constituents with s makes 
this check possible. 
9 Final remarks 
We have formally described, and given faster 
parsing algorithms for, three practical gram- 
matical rewriting systems that capture depen- 
dencies between pairs of words. All three sys- 
tems admit naive O(n 5) algorithms. We give 
the first O(n 4) results for the natural formalism 
of bilexical context-free grammar, and for AI- 
shawi's (1996) head automaton grammars. For 
the usual case, split head automaton grammars 
or equivalent bilexical CFGs, we replace the 
O(n 3) algorithm of (Eisner, 1997) by one with a 
smaller grammar constant. Note that, e.g., all 
senses would restore the g2 factor. Indeed, this approach 
gives added flexibility: a word's sense, unlike its choice 
of flip state, is visible to the HA that reads it. 
three models in (Collins, 1997) are susceptible 
to the O(n 3) method (cf. Collins's O(nh)). 
Our dynamic programming techniques for 
cheaply attaching head information to deriva- 
tions can also be exploited in parsing formalisms 
other than rewriting systems. The authors have 
developed an O(nT)-time parsing algorithm for 
bilexicalized tree adjoining grammars (Schabes, 
1992), improving the naive O(n s) method. 
The results mentioned in §6 are related to the 
closure property of CFGs under generalized se- 
quential machine mapping (Hopcroft and Ull- 
man, 1979). This property also holds for our 
class of bilexical CFGs. 

References 
A. V. Aho and J. D. Ullman. 1972. The Theory 
of Parsing, Translation and Compiling, volume 1. 
Prentice-Hall, Englewood Cliffs, NJ. 
H. Alshawi. 1996. Head automata and bilingual 
tiling: Translation with minimal representations. 
In Proc. of ACL, pages 167-176, Santa Cruz, CA. 
Y. Bar-Hillel. 1953. A quasi-arithmetical notation 
for syntactic description. Language, 29:47-58. 
E. Charniak. 1997. Statistical parsing with a 
context-free grammar and word statistics. In 
Proc. o\] the l~th AAAI, Menlo Park. 
C. Chelba and F. Jelinek. 1998. Exploiting syntac- 
tic structure for language modeling. In Proc. of 
COLING-ACL. 
N. Chomsky. 1965. Aspects of the Theory o\] Syntax. 
MIT Press, Cambridge, MA. 
M. Collins and J. Brooks. 1995. Prepositional 
phrase attachment through a backed-off model. 
In Proe. of the Third Workshop on Very Large 
Corpora, Cambridge, MA. 
M. Collins. 1997. Three generative, lexicalised mod- 
els for statistical parsing. In Proc. of the 35th 
A CL and 8th European A CL, Madrid, July. 
J. Eisner. 1996. An empirical comparison of proba- 
bility models for dependency grammar. Technical 
Report IRCS-96-11, IRCS, Univ. of Pennsylvania. 
J. Eisner. 1997. Bilexical grammars and a cubic- 
time probabilistic parser. In Proceedings of the 
~th Int. Workshop on Parsing Technologies, MIT, 
Cambridge, MA, September. 
R. C. Gonzales and M. G. Thomason. 1978. Syntac- 
tic Pattern Recognition. Addison-Wesley, Read- 
ing, MA. 
M. A. Harrison. 1978. Introduction to Formal Lan- 
guage Theory. Addison-Wesley, Reading, MA. 
J. E. Hopcroft and J. D. Ullman. 1979. Introduc- 
tion to Automata Theory, Languages and Com- 
putation. Addison-Wesley, Reading, MA. 
M. Kay. 1986. Algorithm schemata and data struc- 
tures in syntactic processing. In K. Sparck Jones 
B. J. Grosz and B. L. Webber, editors, Natu- 
ral Language Processing, pages 35-70. Kaufmann, 
Los Altos, CA. 
J. Lafferty, D. Sleator, and D. Temperley. 1992. 
Grammatical trigrams: A probabilistic model of 
link grammar. In Proc. of the AAAI Conf. on 
Probabilistic Approaches to Nat. Lang., October. 
D. Magerman. 1995. Statistical decision-tree mod- 
els for parsing. In Proceedings of the 33rd A CL. 
I. Mel'~uk. 1988. Dependency Syntax: Theory and 
Practice. State University of New York Press. 
C. Pollard and I. Sag. 1994. Head-Driven Phrase 
Structure Grammar. University of Chicago Press. 
Y. Schabes, A. Abeill@, and A. Joshi. 1988. Parsing 
strategies with 'lexicalized' grammars: Applica- 
tion to Tree Adjoining Grammars. In Proceedings 
of COLING-88, Budapest, August. 
Yves Schabes. 1992. Stochastic lexicalized tree- 
adjoining grammars. In Proc. of the l~th COL- 
ING, pages 426-432, Nantes, France, August. 
C. S. Wetherell. 1980. Probabilistic languages: A 
review and some open questions. Computing Sur- 
veys, 12(4):361-379. 
D. H. Younger. 1967. Recognition and parsing of 
context-free languages in time n 3. Information 
and Control, 10(2):189-208, February. 
