Dynamic programming for parsing and estimation of
stochastic uni cation-based grammars 
Stuart Geman
Division of Applied Mathematics
Brown University
geman@dam.brown.edu
Mark Johnson
Cognitive and Linguistic Sciences
Brown University
Mark Johnson@Brown.edu
Abstract
Stochastic uni cation-based grammars
(SUBGs) de ne exponential distributions
over the parses generated by a uni cation-
based grammar (UBG). Existing algo-
rithms for parsing and estimation require
the enumeration of all of the parses of a
string in order to determine the most likely
one, or in order to calculate the statis-
tics needed to estimate a grammar from
a training corpus. This paper describes a
graph-based dynamic programming algo-
rithm for calculating these statistics from
the packed UBG parse representations of
Maxwell and Kaplan (1995) which does
not require enumerating all parses. Like
many graphical algorithms, the dynamic
programming algorithm’s complexity is
worst-case exponential, but is often poly-
nomial. The key observation is that by
using Maxwell and Kaplan packed repre-
sentations, the required statistics can be
rewritten as either the max or the sum of
a product of functions. This is exactly
the kind of problem which can be solved
by dynamic programming over graphical
models.
 We would like to thank Eugene Charniak, Miyao
Yusuke, Mark Steedman as well as Stefan Riezler and the team
at PARC; naturally all errors remain our own. This research was
supported by NSF awards DMS 0074276 and ITR IIS 0085940.
1 Introduction
Stochastic Uni cation-Based Grammars (SUBGs)
use log-linear models (also known as exponential or
MaxEnt models and Markov Random Fields) to de-
 ne probability distributions over the parses of a uni-
 cation grammar. These grammars can incorporate
virtually all kinds of linguistically important con-
straints (including non-local and non-context-free
constraints), and are equipped with a statistically
sound framework for estimation and learning.
Abney (1997) pointed out that the non-context-
free dependencies of a uni cation grammar require
stochastic models more general than Probabilis-
tic Context-Free Grammars (PCFGs) and Markov
Branching Processes, and proposed the use of log-
linear models for de ning probability distributions
over the parses of a uni cation grammar. Un-
fortunately, the maximum likelihood estimator Ab-
ney proposed for SUBGs seems computationally in-
tractable since it requires statistics that depend on
the set of all parses of all strings generated by the
grammar. This set is in nite (so exhaustive enumer-
ation is impossible) and presumably has a very com-
plex structure (so sampling estimates might take an
extremely long time to converge).
Johnson et al. (1999) observed that parsing and
related tasks only require conditional distributions
over parses given strings, and that such conditional
distributions are considerably easier to estimate than
joint distributions of strings and their parses. The
conditional maximum likelihood estimator proposed
by Johnson et al. requires statistics that depend on
the set of all parses of the strings in the training cor-
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 279-286.
                         Proceedings of the 40th Annual Meeting of the Association for
pus. For most linguistically realistic grammars this
set is  nite, and for moderate sized grammars and
training corpora this estimation procedure is quite
feasible.
However, our recent experiments involve training
from the Wall Street Journal Penn Tree-bank, and
repeatedly enumerating the parses of its 50,000 sen-
tences is quite time-consuming. Matters are only
made worse because we have moved some of the
constraints in the grammar from the uni cation com-
ponent to the stochastic component. This broadens
the coverage of the grammar, but at the expense of
massively expanding the number of possible parses
of each sentence.
In the mid-1990s uni cation-based parsers were
developed that do not enumerate all parses of a string
but instead manipulate and return a  packed rep-
resentation of the set of parses. This paper de-
scribes how to  nd the most probable parse and
the statistics required for estimating a SUBG from
the packed parse set representations proposed by
Maxwell III and Kaplan (1995). This makes it pos-
sible to avoid explicitly enumerating the parses of
the strings in the training corpus.
The methods proposed here are analogues of
the well-known dynamic programming algorithms
for Probabilistic Context-Free Grammars (PCFGs);
speci cally the Viterbi algorithm for  nding the
most probable parse of a string, and the Inside-
Outside algorithm for estimating a PCFG from un-
parsed training data.1 In fact, because Maxwell and
Kaplan packed representations are just Truth Main-
tenance System (TMS) representations (Forbus and
de Kleer, 1993), the statistical techniques described
here should extend to non-linguistic applications of
TMSs as well.
Dynamic programming techniques have
been applied to log-linear models before.
Lafferty et al. (2001) mention that dynamic
programming can be used to compute the statistics
required for conditional estimation of log-linear
models based on context-free grammars where
the properties can include arbitrary functions of
the input string. Miyao and Tsujii (2002) (which
1However, because we use conditional estimation, also
known as discriminative training, we require at least some dis-
criminating information about the correct parse of a string in
order to estimate a stochastic uni cation grammar.
appeared after this paper was accepted) is the closest
related work we know of. They describe a technique
for calculating the statistics required to estimate a
log-linear parsing model with non-local properties
from packed feature forests.
The rest of this paper is structured as follows.
The next section describes uni cation grammars
and Maxwell and Kaplan packed representation.
The following section reviews stochastic uni ca-
tion grammars (Abney, 1997) and the statistical
quantities required for ef ciently estimating such
grammars from parsed training data (Johnson et al.,
1999). The  nal substantive section of this paper
shows how these quantities can be de ned directly
in terms of the Maxwell and Kaplan packed repre-
sentations.
The notation used in this paper is as follows. Vari-
ables are written in upper case italic, e.g., X;Y , etc.,
the sets they range over are written in script, e.g.,
X;Y, etc., while speci c values are written in lower
case italic, e.g., x;y, etc. In the case of vector-valued
entities, subscripts indicate particular components.
2 Maxwell and Kaplan packed
representations
This section characterises the properties of uni ca-
tion grammars and the Maxwell and Kaplan packed
parse representations that will be important for what
follows. This characterisation omits many details
about uni cation grammars and the algorithm by
which the packed representations are actually con-
structed; see Maxwell III and Kaplan (1995) for de-
tails.
A parse generated by a uni cation grammar is a
 nite subset of a set F of features. Features are parse
fragments, e.g., chart edges or arcs from attribute-
value structures, out of which the packed representa-
tions are constructed. For this paper it does not mat-
ter exactly what features are, but they are intended
to be the atomic entities manipulated by a dynamic
programming parsing algorithm. A grammar de nes
a set  of well-formed or grammatical parses. Each
parse ! 2  is associated with a string of words
Y (!) called its yield. Note that except for trivial
grammars F and  are in nite.
If y is a string, then let  (y) = f! 2  jY (!) =
yg and F(y) = S!2 (y)ff 2 !g. That is,  (y) is
the set of parses of a string y and F(y) is the set of
features appearing in the parses of y. In the gram-
mars of interest here  (y) and hence also F(y) are
 nite.
Maxwell and Kaplan’s packed representations of-
ten provide a more compact representation of the
set of parses of a sentence than would be obtained
by merely listing each parse separately. The intu-
ition behind these packed representations is that for
most strings y, many of the features in F(y) occur
in many of the parses  (y). This is often the case
in natural language, since the same substructure can
appear as a component of many different parses.
Packed feature representations are de ned in
terms of conditions on the values assigned to a vec-
tor of variables X. These variables have no direct
linguistic interpretation; rather, each different as-
signment of values to these variables identi es a set
of features which constitutes one of the parses in
the packed representation. A condition a on X is
a function from X to f0;1g. While for uniformity
we write conditions as functions on the entire vec-
tor X, in practice Maxwell and Kaplan’s approach
produces conditions whose value depends only on a
few of the variables in X, and the ef ciency of the
algorithms described here depends on this.
A packed representation of a  nite set of parses is
a quadruple R = (F0;X;N; ), where:
 F0  F(y) is a  nite set of features,
 X is a  nite vector of variables, where each
variable X‘ ranges over the  nite set X‘,
 N is a  nite set of conditions on X called the
no-goods,2 and
  is a function that maps each feature f 2 F0
to a condition  f on X.
A vector of values x satis es the no-goods N iff
N(x) = 1, where N(x) = Q 2N  (x). Each x
that satis es the no-goods identi es a parse !(x) =
ff 2 F0j f(x) = 1g, i.e., ! is the set of features
whose conditions are satis ed by x. We require that
each parse be identi ed by a unique value satisfying
2The name  no-good comes from the TMS literature, and
was used by Maxwell and Kaplan. However, here the no-goods
actually identify the good variable assignments.
the no-goods. That is, we require that:
8x;x0 2 X if N(x) = N(x0) = 1 and
!(x) = !(x0) then x = x0 (1)
Finally, a packed representation R represents the
set of parses  (R) that are identi ed by values
that satisfy the no-goods, i.e.,  (R) = f!(x)jx 2
X;N(x) = 1g:
Maxwell III and Kaplan (1995) describes a pars-
ing algorithm for uni cation-based grammars that
takes as input a string y and returns a packed rep-
resentation R such that  (R) =  (y), i.e., R rep-
resents the set of parses of the string y. The SUBG
parsing and estimation algorithms described in this
paper use Maxwell and Kaplan’s parsing algorithm
as a subroutine.
3 Stochastic Uni cation-Based Grammars
This section reviews the probabilistic framework
used in SUBGs, and describes the statistics that
must be calculated in order to estimate the pa-
rameters of a SUBG from parsed training data.
For a more detailed exposition and descriptions
of regularization and other important details, see
Johnson et al. (1999).
The probability distribution over parses is de ned
in terms of a  nite vector g = (g1;::: ;gm) of
properties. A property is a real-valued function of
parses  . Johnson et al. (1999) placed no restric-
tions on what functions could be properties, permit-
ting properties to encode arbitrary global informa-
tion about a parse. However, the dynamic program-
ming algorithms presented here require the informa-
tion encoded in properties to be local with respect to
the features F used in the packed parse representa-
tion. Speci cally, we require that properties be de-
 ned on features rather than parses, i.e., each feature
f 2 F is associated with a  nite vector of real values
(g1(f);::: ;gm(f)) which de ne the property func-
tions for parses as follows:
gk(!) = X
f2!
gk(f); for k = 1::: m: (2)
That is, the property values of a parse are the sum
of the property values of its features. In the usual
case, some features will be associated with a single
property (i.e., gk(f) is equal to 1 for a speci c value
of k and 0 otherwise), and other features will be as-
sociated with no properties at all (i.e., g(f) = 0).
This requires properties be very local with re-
spect to features, which means that we give up the
ability to de ne properties arbitrarily. Note how-
ever that we can still encode essentially arbitrary
linguistic information in properties by adding spe-
cialised features to the underlying uni cation gram-
mar. For example, suppose we want a property that
indicates whether the parse contains a reduced rela-
tive clauses headed by a past participle (such  gar-
den path constructions are grammatical but often
almost incomprehensible, and alternative parses not
including such constructions would probably be pre-
ferred). Under the current de nition of properties,
we can introduce such a property by modifying the
underlying uni cation grammar to produce a certain
 diacritic feature in a parse just in case the parse ac-
tually contains the appropriate reduced relative con-
struction. Thus, while properties are required to be
local relative to features, we can use the ability of
the underlying uni cation grammar to encode essen-
tially arbitrary non-local information in features to
introduce properties that also encode non-local in-
formation.
A Stochastic Uni cation-Based Grammar is a
triple (U;g; ), where U is a uni cation grammar
that de nes a set  of parses as described above,
g = (g1;::: ;gm) is a vector of property functions as
just described, and  = ( 1;::: ; m) is a vector of
non-negative real-valued parameters called property
weights. The probability P (!) of a parse ! 2  is:
P (!) = W (!)Z
 
; where:
W (!) =
mY
j=1
 gj(!)j ; and
Z = X
!02 
W (!0)
Intuitively, if gj(!) is the number of times that prop-
erty j occurs in ! then  j is the ‘weight’ or ‘cost’ of
each occurrence of property j and Z is a normal-
ising constant that ensures that the probability of all
parses sums to 1.
Now we discuss the calculation of several impor-
tant quantities for SUBGs. In each case we show
that the quantity can be expressed as the value that
maximises a product of functions or else as the sum
of a product of functions, each of which depends
on a small subset of the variables X. These are the
kinds of quantities for which dynamic programming
graphical model algorithms have been developed.
3.1 The most probable parse
In parsing applications it is important to be able to
extract the most probable (or MAP) parse ^!(y) of
string y with respect to a SUBG. This parse is:
^!(y) = argmax
!2 (y)
W (!)
Given a packed representation (F0;X;N; ) for the
parses  (y), let ^x(y) be the x that identi es ^!(y).
Since W (^!(y)) > 0, it can be shown that:
^x(y) = argmax
x2X
N(x)
mY
j=1
 gj(!(x))j
= argmax
x2X
N(x)
mY
j=1
 
P
f2!(x) gj(f)
j
= argmax
x2X
N(x)
mY
j=1
 
P
f2F0  f (x)gj(f)
j
= argmax
x2X
N(x)
mY
j=1
Y
f2F0
  f(x)gj(f)j
= argmax
x2X
N(x) Y
f2F0
0
@
mY
j=1
 gj(f)j
1
A
 f (x)
= argmax
x2X
Y
 2N
 (x) Y
f2F0
h ;f(x) (3)
where h ;f(x) = Qmj=1  gj(f)j if  f(x) = 1 and
h ;f(x) = 1 if  f(x) = 0. Note that h ;f(x) de-
pends on exactly the same variables in X as  f does.
As (3) makes clear,  nding ^x(y) involves maximis-
ing a product of functions where each function de-
pends on a subset of the variables X. As explained
below, this is exactly the kind of maximisation that
can be solved using graphical model techniques.
3.2 Conditional likelihood
We now turn to the estimation of the property
weights  from a training corpus of parsed data D =
(!1;::: ;!n). As explained in Johnson et al. (1999),
one way to do this is to  nd the  that maximises the
conditional likelihood of the training corpus parses
given their yields. (Johnson et al. actually maximise
conditional likelihood regularized with a Gaussian
prior, but for simplicity we ignore this here). If yi is
the yield of the parse !i, the conditional likelihood
of the parses given their yields is:
LD( ) =
nY
i=1
W (!i)
Z ( (yi))
where  (y) is the set of parses with yield y and:
Z (S) = X
!2S
W (!):
Then the maximum conditional likelihood estimate
^ of  is ^ = argmax LD( ).
Now calculating W (!i) poses no computational
problems, but since  (yi) (the set of parses for yi)
can be large, calculating Z ( (yi)) by enumerating
each ! 2  (yi) can be computationally expensive.
However, there is an alternative method for calcu-
lating Z ( (yi)) that does not involve this enumera-
tion. As noted above, for each yield yi;i = 1;::: ;n,
Maxwell’s parsing algorithm returns a packed fea-
ture structure Ri that represents the parses of yi, i.e.,
 (yi) =  (Ri). A derivation parallel to the one for
(3) shows that for R = (F0;X;N; ):
Z ( (R)) = X
x2X
Y
 2N
 (x) Y
f2F0
h ;f(x) (4)
(This derivation relies on the isomorphism between
parses and variable assignments in (1)). It turns out
that this type of sum can also be calculated using
graphical model techniques.
3.3 Conditional Expectations
In general, iterative numerical procedures are re-
quired to  nd the property weights  that maximise
the conditional likelihood LD( ). While there are
a number of different techniques that can be used,
all of the ef cient techniques require the calculation
of conditional expectations E [gkjyi] for each prop-
erty gk and each sentence yi in the training corpus,
where:
E [gjy] = X
!2 (y)
g(!)P (!jy)
=
P
!2 (y) g(!)W (!)
Z ( (y))
For example, the Conjugate Gradient algorithm,
which was used by Johnson et al., requires the cal-
culation not just of LD( ) but also its derivatives
@LD( )=@ k. It is straight-forward to show:
@LD( )
@ k =
LD( )
 k
nX
i=1
(gk(!i)  E [gkjyi]) :
We have just described the calculation of LD( ),
so if we can calculate E [gkjyi] then we can calcu-
late the partial derivatives required by the Conjugate
Gradient algorithm as well.
Again, let R = (F0;X;N; ) be a packed repre-
sentation such that  (R) =  (yi). First, note that
(2) implies that:
E [gkjyi] = X
f2F0
gk(f) P(f! : f 2 !gjyi):
Note that P(f! : f 2 !gjyi) involves the sum of
weights over all x 2 X subject to the conditions
that N(x) = 1 and  f(x) = 1. Thus P(f! : f 2
!gjyi) can also be expressed in a form that is easy
to evaluate using graphical techniques.
Z ( (R))P (f! : f 2 !gjyi)
= X
x2X
 f(x) Y
 2N
 (x) Y
f02F0
h ;f0(x) (5)
4 Graphical model calculations
In this section we brie y review graphical model
algorithms for maximising and summing products
of functions of the kind presented above. It turns
out that the algorithm for maximisation is a gener-
alisation of the Viterbi algorithm for HMMs, and
the algorithm for computing the summation in (5)
is a generalisation of the forward-backward algo-
rithm for HMMs (Smyth et al., 1997). Viewed
abstractly, these algorithms simplify these expres-
sions by moving common factors over the max or
sum operators respectively. These techniques are
now relatively standard; the most well-known ap-
proach involves junction trees (Pearl, 1988; Cow-
ell, 1999). We adopt the approach approach de-
scribed by Geman and Kochanek (2000), which is
a straightforward generalization of HMM dynamic
programming with minimal assumptions and pro-
gramming overhead. However, in principle any of
the graphical model computational algorithms can
be used.
The quantities (3), (4) and (5) involve maximisa-
tion or summation over a product of functions, each
of which depends only on the values of a subset of
the variables X. There are dynamic programming
algorithms for calculating all of these quantities, but
for reasons of space we only describe an algorithm
for  nding the maximum value of a product of func-
tions. These graph algorithms are rather involved.
It may be easier to follow if one reads Example 1
before or in parallel with the de nitions below.
To explain the algorithm we use the following no-
tation. If x and x0 are both vectors of length m
then x =j x0 iff x and x0 disagree on at most their
jth components, i.e., xk = x0k for k = 1;::: ;j  
1;j + 1;::: m. If f is a function whose domain
is X, we say that f depends on the set of variables
d(f) = fXjj9x;x0 2 X;x =j x0;f(x) 6= f(x0)g.
That is, Xj 2 d(f) iff changing the value of Xj can
change the value of f.
The algorithm relies on the fact that the variables
in X = (X1;::: ;Xn) are ordered (e.g., X1 pre-
cedes X2, etc.), and while the algorithm is correct
for any variable ordering, its ef ciency may vary
dramatically depending on the ordering as described
below. Let H be any set of functions whose do-
mains are X. We partition H into disjoint subsets
H1;::: ;Hn+1, where Hj is the subset of H that de-
pend on Xj but do not depend on any variables or-
dered before Xj, and Hn+1 is the subset of H that do
not depend on any variables at all (i.e., they are con-
stants).3 That is, Hj = fH 2 HjXj 2 d(H);8i <
j Xi 62 d(H)g and Hn+1 = fH 2 Hjd(H) = ;g.
As explained in section 3.1, there is a set of func-
tions A such that the quantities we need to calculate
have the general form:
Mmax = maxx2X Y
A2A
A(x) (6)
^x = argmax
x2X
Y
A2A
A(x): (7)
Mmax is the maximum value of the product expres-
sion while ^x is the value of the variables at which the
maximum occurs. In a SUBG parsing application ^x
identi es the MAP parse.
3Strictly speaking this does not necessarily de ne a parti-
tion, as some of the subsets Hj may be empty.
The procedure depends on two sequences of func-
tions Mi;i = 1;::: ;n + 1 and Vi;i = 1;::: ;n.
Informally, Mi is the maximum value attained by
the subset of the functions A that depend on one of
the variables X1;::: ;Xi, and Vi gives information
about the value of Xi at which this maximum is at-
tained.
To simplify notation we write these functions as
functions of the entire set of variables X, but usu-
ally depend on a much smaller set of variables. The
Mi are real valued, while each Vi ranges over Xi.
Let M = fM1;::: ;Mng. Recall that the sets of
functions A and M can be both be partitioned into
disjoint subsets A1;::: ;An+1 and M1;::: ;Mn+1
respectively on the basis of the variables each Ai
and Mi depend on. The de nition of the Mi and
Vi;i = 1;::: ;n is as follows:
Mi(x) = max
x02X
s:t: x0=ix
Y
A2Ai
A(x0) Y
M2Mi
M(x0) (8)
Vi(x) = argmax
x02X
s:t: x0=ix
Y
A2Ai
A(x0) Y
M2Mi
M(x0)
Mn+1 receives a special de nition, since there is no
variable Xn+1.
Mn+1 =
0
@ Y
A2An+1
A
1
A
0
@ Y
M2Mn+1
M
1
A (9)
The de nition of Mi in (8) may look circular (since
M appears in the right-hand side), but in fact it is
not. First, note that Mi depends only on variables
ordered after Xi, so if Mj 2 Mi then j < i. More
speci cally,
d(Mi) =
0
@ [
A2Ai
d(A) [ [
M2Mi
d(M)
1
A nfXig:
Thus we can compute the Mi in the order
M1;::: ;Mn+1, inserting Mi into the appropriate set
Mk, where k > i, when Mi is computed.
We claim that Mmax = Mn+1. (Note that Mn+1
and Mn are constants, since there are no variables
ordered after Xn). To see this, consider the tree T
whose nodes are the Mi, and which has a directed
edge from Mi to Mj iff Mi 2 Mj (i.e., Mi appears
in the right hand side of the de nition (8) of Mj).
T has a unique root Mn+1, so there is a path from
every Mi to Mn+1. Let i  j iff there is a path
from Mi to Mj in this tree. Then a simple induction
shows that Mj is a function from d(Mj) to a max-
imisation over each of the variables Xi where i  j
of Qi j;A2Ai A.
Further, it is straightforward to show that Vi(^x) =
^xi (the value ^x assigns to Xi). By the same argu-
ments as above, d(Vi) only contains variables or-
dered after Xi, so Vn = ^xn. Thus we can evaluate
the Vi in the order Vn;::: ;V1 to  nd the maximising
assignment ^x.
Example 1 Let X = f X1; X2; X3; X4; X5;
X6; X7g and set A = fa(X1;X3); b(X2;X4);
c(X3;X4;X5); d(X4;X5); e(X6;X7)g. We can
represent the sharing of variables in Aby means of a
undirected graph GA, where the nodes of GA are the
variables X and there is an edge in GA connecting
Xi to Xj iff 9A 2 A such that both Xi;Xj 2 d(A).
GA is depicted below.
 
 
  
X1 X3 X5 X6
X2 X4 X7
r r r
rr
r
r
Starting with the variable X1, we compute M1
and V1:
M1(x3) = max
x12X1
a(x1;x3)
V1(x3) = argmax
x12X1
a(x1;x3)
We now proceed to the variable X2.
M2(x4) = maxx
22X2
b(x2;x4)
V2(x4) = argmax
x22X2
b(x2;x4)
Since M1 belongs to M3, it appears in the de nition
of M3.
M3(x4;x5) = max
x32X3
c(x3;x4;x5)M1(x3)
V3(x4;x5) = argmax
x32X3
c(x3;x4;x5)M1(x3)
Similarly, M4 is de ned in terms of M2 and M3.
M4(x5) = maxx
42X4
d(x4;x5)M2(x4)M3(x4;x5)
V4(x5) = argmax
x42X4
d(x4;x5)M2(x4)M3(x4;x5)
Note that M5 is a constant, re ecting the fact that
in GA the node X5 is not connected to any node or-
dered after it.
M5 = maxx
52X5
M4(x5)
V5 = argmax
x52X5
M4(x5)
The second component is de ned in the same way:
M6(x7) = maxx
62X6
e(x6;x7)
V6(x7) = argmax
x62X6
e(x6;x7)
M7 = maxx
72X7
M6(x7)
V7 = argmax
x72X7
M6(x7)
The maximum value for the product M8 = Mmax is
de ned in terms of M5 and M7.
Mmax = M8 = M5M7
Finally, we evaluate V7;::: ;V1 to  nd the maximis-
ing assignment ^x.
^x7 = V7
^x6 = V6(^x7)
^x5 = V5
^x4 = V4(^x5)
^x3 = V3(^x4; ^x5)
^x2 = V2(^x4)
^x1 = V1(^x3)
We now brie y consider the computational com-
plexity of this process. Clearly, the number of steps
required to compute each Mi is a polynomial of or-
der jd(Mi)j+1, since we need to enumerate all pos-
sible values for the argument variables d(Mi) and
for each of these, maximise over the set Xi. Fur-
ther, it is easy to show that in terms of the graph GA,
d(Mj) consists of those variables Xk;k > j reach-
able by a path starting at Xj and all of whose nodes
except the last are variables that precede Xj.
Since computational effort is bounded above by a
polynomial of order jd(Mi)j+ 1, we seek a variable
ordering that bounds the maximum value of jd(Mi)j.
Unfortunately,  nding the ordering that minimises
the maximum value of jd(Mi)j is an NP-complete
problem. However, there are several ef cient heuris-
tics that are reputed in graphical models community
to produce good visitation schedules. It may be that
they will perform well in the SUBG parsing applica-
tions as well.
5 Conclusion
This paper shows how to apply dynamic program-
ming methods developed for graphical models to
SUBGs to  nd the most probable parse and to ob-
tain the statistics needed for estimation directly from
Maxwell and Kaplan packed parse representations.
i.e., without expanding these into individual parses.
The algorithm rests on the observation that so long
as features are local to the parse fragments used in
the packed representations, the statistics required for
parsing and estimation are the kinds of quantities
that dynamic programming algorithms for graphical
models can perform. Since neither Maxwell and Ka-
plan’s packed parsing algorithm nor the procedures
described here depend on the details of the underly-
ing linguistic theory, the approach should apply to
virtually any kind of underlying grammar.
Obviously, an empirical evaluation of the algo-
rithms described here would be extremely useful.
The algorithms described here are exact, but be-
cause we are working with uni cation grammars
and apparently arbitrary graphical models we can-
not polynomially bound their computational com-
plexity. However, it seems reasonable to expect
that if the linguistic dependencies in a sentence typ-
ically factorize into largely non-interacting cliques
then the dynamic programming methods may offer
dramatic computational savings compared to current
methods that enumerate all possible parses.
It might be interesting to compare these dy-
namic programming algorithms with a standard
uni cation-based parser using a best- rst search
heuristic. (To our knowledge such an approach has
not yet been explored, but it seems straightforward:
the  gure of merit could simply be the sum of the
weights of the properties of each partial parse’s frag-
ments). Because such parsers prune the search space
they cannot guarantee correct results, unlike the al-
gorithms proposed here. Such a best- rst parser
might be accurate when parsing with a trained gram-
mar, but its results may be poor at the beginning
of parameter weight estimation when the parameter
weight estimates are themselves inaccurate.
Finally, it would be extremely interesting to com-
pare these dynamic programming algorithms to
the ones described by Miyao and Tsujii (2002). It
seems that the Maxwell and Kaplan packed repre-
sentation may permit more compact representations
than the disjunctive representations used by Miyao
et al., but this does not imply that the algorithms
proposed here are more ef cient. Further theoreti-
cal and empirical investigation is required.
References
Steven Abney. 1997. Stochastic Attribute-Value Grammars.
Computational Linguistics, 23(4):597 617.
Robert Cowell. 1999. Introduction to inference for Bayesian
networks. In Michael Jordan, editor, Learning in Graphi-
cal Models, pages 9 26. The MIT Press, Cambridge, Mas-
sachusetts.
Kenneth D. Forbus and Johan de Kleer. 1993. Building problem
solvers. The MIT Press, Cambridge, Massachusetts.
Stuart Geman and Kevin Kochanek. 2000. Dynamic program-
ming and the representation of soft-decodable codes. Tech-
nical report, Division of Applied Mathematics, Brown Uni-
versity.
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and
Stefan Riezler. 1999. Estimators for stochastic  uni cation-
based grammars. In The Proceedings of the 37th Annual
Conference of the Association for Computational Linguis-
tics, pages 535 541, San Francisco. Morgan Kaufmann.
John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.
Conditional Random Fields: Probabilistic models for seg-
menting and labeling sequence data. In Machine Learn-
ing: Proceedings of the Eighteenth International Conference
(ICML 2001), Stanford, California.
John T. Maxwell III and Ronald M. Kaplan. 1995. A method
for disjunctive constraint satisfaction. In Mary Dalrymple,
Ronald M. Kaplan, John T. Maxwell III, and Annie Zae-
nen, editors, Formal Issues in Lexical-Functional Grammar,
number 47 in CSLI Lecture Notes Series, chapter 14, pages
381 481. CSLI Publications.
Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximum entropy
estimation for feature forests. In Proceedings of Human
Language Technology Conference 2002, March.
Judea Pearl. 1988. Probabalistic Reasoning in Intelligent Sys-
tems: Networks of Plausible Inference. Morgan Kaufmann,
San Mateo, California.
Padhraic Smyth, David Heckerman, and Michael Jordan. 1997.
Probabilistic Independence Networks for Hidden Markov
Models. Neural Computation, 9(2):227 269.
