Exploiting auxiliary distributions in stochastic unification-based 
grammars 
Mark Johnson* 
Cognitive and Linguistic Sciences 
Brown University 
Mark_Johnson@Brown.edu 
Stefan Riezler 
Institut fiir Maschinelle Sprachverarbeitung 
Universit£t Stuttgart 
riezler~ims.uni-stuttgart.de 
Abstract 
This paper describes a method for estimat- 
ing conditional probability distributions over 
the parses of "unification-based" grammars 
which can utilize auxiliary distributions that 
are estimated by other means. We show how 
this can be used to incorporate information 
about lexical selectional preferences gathered 
from other sources into Stochastic "Unification- 
based" Grammars (SUBGs). While we ap- 
ply this estimator to a Stochastic Lexical- 
Functional Grammar, the method is general, 
and should be applicable to stochastic versions 
of HPSGs, categorial grammars and transfor- 
mational grammars. 
1 Introduction 
"Unification-based" Grammars (UBGs) can 
capture a wide variety of linguistically impor- 
tant syntactic and semantic constraints. How- 
ever, because these constraints can be non-local 
or context-sensitive, developing stochastic ver- 
sions of UBGs and associated estimation pro- 
cedures is not as straight-forward as it is for, 
e.g., PCFGs. Recent work has shown how to 
define probability distributions over the parses 
of UBGs (Abney, 1997) and efficiently estimate 
and use conditional probabilities for parsing 
(Johnson et al., 1999). Like most other practical 
stochastic grammar estimation procedures, this 
latter estimation procedure requires a parsed 
training corpus. 
Unfortunately, large parsed UBG corpora are 
not yet available. This restricts the kinds of 
models one can realistically expect to be able 
to estimate. For example, a model incorporat- 
ing lexical selectional preferences of the kind 
* This research was supported by NSF awards 9720368, 
9870676 and 9812169. 
described below might have tens or hundreds 
of thousands of parameters, which one could 
not reasonably attempt to estimate from a cor- 
pus with on the order of a thousand clauses. 
However, statistical models of lexical selec- 
tional preferences can be estimated from very 
large corpora based on simpler syntactic struc- 
tures, e.g., those produced by a shallow parser. 
While there is undoubtedly disagreement be- 
tween these simple syntactic structures and the 
syntactic structures produced by the UBG, one 
might hope that they are close enough for lexical 
information gathered from the simpler syntactic 
structures to be of use in defining a probability 
distribution over the UBG's structures. 
In the estimation procedure described here, 
we call the probability distribution estimated 
from the larger, simpler corpus an auxiliary dis- 
tribution. Our treatment of auxiliary distribu- 
tions is inspired by the treatment of reference 
distributions in Jelinek's (1997) presentation of 
Maximum Entropy estimation, but in our es- 
timation procedure we simply regard the loga- 
rithm of each auxiliary distribution as another 
(real-valued) feature. Despite its simplicity, our 
approach seems to offer several advantages over 
the reference distribution approach. First, it 
is straight-forward to utilize several auxiliary 
distributions simultaneously: each is treated as 
a distinct feature. Second, each auxiliary dis- 
tribution is associated with a parameter which 
scales its contribution to the final distribution. 
In applications such as ours where the auxiliary 
distribution may be of questionable relevance 
to the distribution we are trying to estimate, it 
seems reasonable to permit the estimation pro- 
cedure to discount or even ignore the auxiliary 
distribution. Finally, note that neither Jelinek's 
nor our estimation procedures require that an 
auxiliary or reference distribution Q be a prob- 
154 
ability distribution; i.e., it is not necessary that 
Q(i2) -- 1, where f~ is the set of well-formed 
linguistic structures. 
The rest of this paper is structured as fol- 
lows. Section 2 reviews how exponential mod- 
els can be defined over the parses of UBGs, 
gives a brief description of Stochastic Lexical- 
Functional Grammar, and reviews why maxi- 
mum pseudo-likelihood estimation is both feasi- 
ble and sufficient of parsing purposes. Section 3 
presents our new estimator, and shows how it 
is related to the minimization of the Kullback- 
Leibler divergence between the conditional es- 
timated and auxiliary distributions. Section 4 
describes the auxiliary distribution used in our 
experiments, and section 5 presents the results 
of those experiments. 
2 Stochastic Unification-based 
Grammars 
Most of the classes of probabilistic  
models used in computational linguistic are ex- 
ponential families. That is, the probability P(w) 
of a well-formed syntactic structure w E ~ is de- 
fined by a function of the form 
PA(w) = Q(~v) eX.f(oj ) (1) 
where f(w) E R m is a vector of feature values, 
)~ E It m is a vector of adjustable feature param- 
eters, Q is a function of w (which Jelinek (1997) 
calls a reference distribution when it is not an in- 
dicator function), and ZA = fn Q(w) ex'f(~)dw is 
a normalization factor called the partition func- 
tion. (Note that a feature here is just a real- 
valued function of a syntactic structure w; to 
avoid confusion we use the term "attribute" to 
refer to a feature in a feature structure). If 
Q(w) = 1 then the class of exponential dis- 
tributions is precisely the class of distributions 
with maximum entropy satisfying the constraint 
that the expected values of the features is a cer- 
tain specified value (e.g., a value estimated from 
training data), so exponential models are some- 
times also called "Maximum Entropy" models. 
For example, the class of distributions ob- 
tained by varying the parameters of a PCFG 
is an exponential family. In a PCFG each rule 
or production is associated with a feature, so m 
is the number of rules and the jth feature value 
fj (o.,) is the number of times the j rule is used 
in the derivation of the tree w E ~. Simple ma- 
nipulations show that P,x (w) is equivalent to the 
PCFG distribution ifAj = logpj, where pj is the 
rule emission probability, and Q(w) = Z~ = 1. 
If the features satisfy suitable Markovian in- 
dependence constraints, estimation from fully 
observed training data is straight-forward. For 
example, because the rule features of a PCFG 
meet "context-free" Markovian independence 
conditions, the well-known "relative frequency" 
estimator for PCFGs both maximizes the likeli- 
hood of the training data (and hence is asymp- 
totically consistent and efficient) and minimizes 
the Kullback-Leibler divergence between train- 
ing and estimated distributions. 
However, the situation changes dramatically 
if we enforce non-local or context-sensitive con- 
straints on linguistic structures of the kind that 
can be expressed by a UBG. As Abney (1997) 
showed, under these circumstances the relative 
frequency estimator is in general inconsistent, 
even if one restricts attention to rule features. 
Consequently, maximum likelihood estimation 
is much more complicated, as discussed in sec- 
tion 2.2. Moreover, while rule features are natu- 
ral for PCFGs given their context-free indepen- 
dence properties, there is no particular reason 
to use only rule features in Stochastic UBGs 
(SUBGs). Thus an SUBG is a triple (G, f, A), 
where G is a UBG which generates a set of well- 
formed linguistic structures i2, and f and A are 
vectors of feature functions and feature param- 
eters as above. The probability of a structure 
w E ~ is given by (1) with Q(w) = 1. Given a 
base UBG, there are usually infinitely many dif- 
ferent ways of selecting the features f to make 
a SUBG, and each of these makes an empirical 
claim about the class of possible distributions 
of structures. 
2.1 Stochastic Lexical Functional 
Grammar 
Stochastic Lexical-Functional Grammar 
(SLFG) is a stochastic extension of Lexical- 
Functional Grammar (LFG), a UBG formalism 
developed by Kaplan and Bresnan (1982). 
Given a base LFG, an SLFG is constructed 
by defining features which identify salient 
constructions in a linguistic structure (in LFG 
this is a c-structure/f-structure pair and its 
associated mapping; see Kaplan (1995)). Apart 
from the auxiliary distributions, we based our 
155 
features on those used in Johnson et al. (1999), 
which should be consulted for further details. 
Most of these feature values range over the 
natural numbers, counting the number of times 
that a particular construction appears in a 
linguistic structure. For example, adjunct and 
argument features count the number of adjunct 
and argument attachments, permitting SLFG 
to capture a general argument attachment pref- 
erence, while more specialized features count 
the number of attachments to each grammatical 
function (e.g., SUB J, OBJ, COMP, etc.). 
The flexibility of features in stochastic UBGs 
permits us to include features for relatively 
complex constructions, such as date expres- 
sions (it seems that date interpretations, if 
possible, are usually preferred), right-branching 
constituent structures (usually preferred) and 
non-parallel coordinate structures (usually 
dispreferred). Johnson et al. remark that they 
would have liked to have included features for 
lexical selectional preferences. While such fea- 
tures are perfectly acceptable in a SLFG, they 
felt that their corpora were so small that the 
large number of lexical dependency parameters 
could not be accurately estimated. The present 
paper proposes a method to address this by 
using an auxiliary distribution estimated from 
a corpus large enough to (hopefully) provide 
reliable estimates for these parameters. 
2.2 Estimating stochastic 
unification-based grammars 
Suppose ~ = Wl,...,Wn is a corpus of n syn- 
tactic structures. Letting fj(fJ) = ~--~=1 fj(oJi) 
and assuming each wi E 12, the likelihood of the 
corpus L~(&) is: 
T~ 
L~(~) = 1-I Px(w,) 
i=1 
= e ~/(c~) Z-~ n (2) 
0 logan(&) = fj(Co) 
-- nEa(fj) (3) 0Aj 
where E~(fj) is the expected value of f~ un- 
der the distribution P~. The maximum likeli- 
hood estimates are the )~ which maximize (2), or 
equivalently, which make (3) zero, but as John- 
son et al. (1999) explain, there seems to be no 
practical way of computing these for realistic 
SUBGs since evaluating (2) and its derivatives 
(3) involves integrating over all syntactic struc- 
tures ft. 
However, Johnson et al. observe that parsing 
applications require only the conditional prob- 
ability distribution P~(wly), where y is the ter- 
minal string or yield being parsed, and that this 
can be estimated by maximizing the pseudo- 
likelihood of the corpus PL~(SJ): 
rz 
PLx(SJ) = II P~(wilyi) i=-I 
n 
= eA'f(w) ~I Z; l(yi) 
i=1 
In (4), Yi is the yield of wi and 
Z~(yi) = f~(y,) e~I(~)dw, 
(4) 
where f~(Yi) is the set of all syntactic structures 
in f~ with yield yi (i.e., all parses of Yi gener- 
ated by the base UBG). It turns out that cal- 
culating the pseudo-likelihood of a corpus only 
involves integrations over the sets of parses of 
its yields f~(Yi), which is feasible for many inter- 
esting UBGs. Moreover, the maximum pseudo- 
likelihood estimator is asymptotically consistent 
for the conditional distribution P(w\]y). For the 
reasons explained in Johnson et al. (1999) we ac- 
tually estimate )~ by maximizing a regularized 
version of the log pseudo-likelihood (5), where 
aj is 7 times the maximum value of fj found in 
the training corpus: 
m ~2 
logPL~(~) - ~ 2"~2 (5) 
j=l vj 
See Johnson et al. (1999) for details of the calcu- 
lation of this quantity and its derivatives, and 
the conjugate gradient routine used to calcu- 
late the )~ which maximize the regularized log 
pseudo-likelihood of the training corpus. 
3 Auxiliary distributions 
We modify the estimation problem presented in 
section 2.2 by assuming that in addition to the 
corpus ~ and the m feature functions f we are 
given k auxiliary distributions Q1,..., Qk whose 
support includes f~ that we suspect may be re- 
lated to the joint distribution P(w) or condi- 
tional distribution P(wly ) that we wish to esti- 
156 
mate. We do not require that the Qj be proba- 
bility distributions, i.e., it is not necessary that 
f~ Qj(w)dw = 1, but we do require that they 
are strictly positive (i.e., Qj(w) > O, Vw E ~). 
We define k new features fro+l,..., fm+k where 
fm+j(w) = log Qj(w), which we call auxiliary 
features. The m + k parameters associated with 
the resulting m+k features can be estimated us- 
ing any method for estimating the parameters 
of an exponential family with real-valued fea- 
tures (in our experiments we used the pseudo- 
likelihood estimation procedure reviewed in sec- 
tion 2.2). Such a procedure estimates parame- 
ters )~m+l,.-., Am+k associated with the auxil- 
iary features, so the estimated distributions take 
the form (6) (for simplicity we only discuss joint 
distributions here, but the treatment of condi- 
tional distributions is parallel). 
P,(w) = I'Ik=l QJ(w)A~+J eZ_,~=lAjlj(~)(6 )v--~ 
Note that the auxiliary distributions Qj are 
treated as fixed distributions for the purposes 
of this estimation, even though each Qj may it- 
self be a complex model obtained via a previous 
estimation process. Comparing (6) with (1) on 
page 2, we see that the two equations become 
identical if the reference distribution Q in (1) is 
replaced by a geometric mixture of the auxiliary 
distributions Qj, i.e., if: 
k 
Q(w) = ~I Q~(w) xm+i- 
j=l 
The parameter associated with an auxiliary fea- 
ture represents the weight of that feature in the 
mixture. If a parameter ~m+j = 1 then the 
corresponding auxiliary feature Qj is equivalent 
to a reference distribution in Jelinek's sense, 
while if ~m+j = 0 then Qj is effectively ig- 
nored. Thus our approach can be regarded as 
a smoothed version Jelinek's reference distribu- 
tion approach, generalized to permit multiple 
auxiliary distributions. 
4 Lexical selectional preferences 
The auxiliary distribution we used here is based 
on the probabilistic model of lexical selectional 
preferences described in Rooth et al. (1999). An 
existing broad-coverage parser was used to find 
shallow parses (compared to the LFG parses) 
for the 117 million word British National Cor- 
pus (Carroll and Rooth, 1998). We based our 
auxiliary distribution on 3.7 million (g, r, a) tu- 
ples (belonging to 600,000 types) we extracted 
these parses, where g is a lexical governor (for 
the shallow parses, g is either a verb or a prepo- 
sition), a is the head of one of its NP arguments 
and r is the the grammatical relationship be- 
tween the governor and argument (in the shal- 
low parses r is always OBJ for prepositional gov- 
ernors, and r is either SUBJ or OBJ for verbal 
governors). 
In order to avoid sparse data problems we 
smoothed this distribution over tuples as de- 
scribed in (Rooth et al., 1999). We assume that 
governor-relation pairs (g, r) and arguments a 
are independently generated from 25 hidden 
classes C, i.e.: 
P((g,r,a)) = ~'~ Pe((g,r)lc)~)e(alc)ee(c) 
cEC 
where the distributions Pe are estimated from 
the training tuples using the Expectation- 
Maximization algorithm. While the hidden 
classes axe not given any prior interpretation 
they often cluster semantically coherent pred- 
icates and arguments, as shown in Figure 1. 
The smoothing power of a clustering model such 
as this can be calculated explicitly as the per- 
centage of possible tuples which are assigned a 
non-zero probability. For the 25-class model 
we get a smoothing power of 99%, compared 
to only 1.7% using the empirical distribution of 
the training data. 
5 Empirical evaluation 
Hadar Shemtov and Ron Kaplan at Xerox PARC 
provided us with two LFG parsed corpora called 
the Verbmobil corpus and the Homecentre cor- 
pus. These contain parse forests for each sen- 
tence (packed according to scheme described in 
Maxwell and Kaplan (1995)), together with a 
manual annotation as to which parse is cor- 
rect. The Verbmobil corpus contains 540 sen- 
tences relating to appointment planning, while 
the Homecentre corpus contains 980 sentences 
from Xerox documentation on their "homecen- 
tre" multifunction devices. Xerox did not pro- 
vide us with the base LFGs for intellectual prop- 
erty reasons, but from inspection of the parses 
157 
Class 16 
PROB 0.0340 o¢5 d d ~ c;d ¢~ d dd ~ d d d¢5¢~ d ¢5~ ~d~dd dd ¢~d ¢~ 
I. 
0.3183 say:s \] • • 
0.0405 say:o i • • 
0.0345 ask:s • 
0.0276 tell:s • • 
0.0214 be:s • 
0.0193 know:s • 
0.0147 h&ve:s 
0.0144 nod:s • 
0.0137 thlnk:s • 
0.0130 shake:s • 
0.0128 take:s • 
0.0104 reply:s • 
0.0096 smile:s • 
10.0094 do:s 
0.0094 laugh:s • 
0.0089 telho 
0.0084 saw:s • 
~ 0.0082 add:s • 
0.0078 feehs • 
0.0071 make:s • 
0.0070 give:s • • 
0.0067 ask:o • 
0.0066 shrug:s • 
0.0061 explain:s • • 
0.0051 like:s • 
0.0050 Iook:s 
0.0050 sigh:s • 
0.0049 watch:s • 
0.0049 hear:s 
0.0047 answer:s • 
• • • • • • • • • • • • • • • • o • • • • • • • • • • :::::::::::::::::::::::::::: 
• • Q • • • • • o • • • • • • • • • • • • • • • • • o • 
• • • • • • • • o • • • • • • • • • • • • • • • 
• • • o • • • • • • • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • • o • • • • • • • 
• • • o • • • • • • • • • • • • o • o • • • • • • 
• • • • • • • • • • • • • • • • • • • o • • • • • • • • 
• • • • • • • • • • • • • • o • • • • o • 
.... . .............. :::::::: 
• • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • • • • 6 • • • • • • • 
• • • • • • • • • • • • • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • • o • • • • • • • 
• • • • • • • • • • • • • • • • • • • • • • • • • • • • :::::::::::::::::::::::::::: 
• • • • • • • • • • • o o • • • • • • • 
• • • • • • • • • • • • • • Q t • • • • 
• • • • o • • • • • • • • • • • • Q • • • • • 
• • • • • • • • • • • • • • • • • • • • • • 
• • • • o • • • • • • • • • • • • • • 
Figure 1: A depiction of the highest probability predicates and arguments in Class 16. The class 
matrix shows at the top the 30 most probable nouns in the Pe (a116) distribution and their probabil- 
ities, and at the left the 30 most probable verbs and prepositions listed according to Pre((g, r)116) 
and their probabilities. Dots in the matrix indicate that the respective pair was seen in the training 
data. Predicates with suffix : s indicate the subject slot of an intransitive or transitive verb; the 
suffix : o specifies the nouns in the corresponding row as objects of verbs or prepositions. 
it seems that slightly different grammars were 
used with each corpus, so we did not merge the 
corpora. We chose the features of our SLFG 
based solely on the basis of the Verbmobil cor- 
pus, so the Homecentre corpus can be regarded 
as a held-out evaluation corpus. 
We discarded the unambiguous sentences in 
each corpus for both training and testing (as 
explained in Johnson et al. (1999), pseudo- 
likelihood estimation ignores unambiguous sen- 
tences), leaving us with a corpus of 324 am- 
biguous sentences in the Verbmobil corpus and 
481 sentences in the Homecentre corpus; these 
sentences had a total of 3,245 and 3,169 parses 
respectively. 
The (non-auxiliary) features used in were 
based on those described by Johnson et 
al. (1999). Different numbers of features 
were used with the two corpora because 
some of the features were generated semi- 
automatically (e.g., we introduced a feature for 
every attribute-value pair found in any feature 
structure), and "pseudo-constant" features (i.e., 
features whose values never differ on the parses 
of the same sentence) are discarded. We used 
172 features in the SLFG for the Verbmobil cor- 
pus and 186 features in the SLFG for the Home- 
centre corpus. 
We used three additional auxiliary features 
derived from the lexical selectional preference 
model described in section 4. These were de- 
fined in the following way. For each governing 
predicate g, grammatical relation r and argu- 
ment a, let n(g,r,a)(w) be the number of times 
that the f-structure: 
PRED ~ g \] 
r = \[PRED=a\] 
appears as a subgraph of the f-structure of 
w, i.e., the number of times that a fills the 
158 
grammatical role r of g. We used the lexical 
model described in the last section to estimate 
P(alg , r), and defined our first auxiliary feature 
as: 
ft(w) = logP(g0) + Z n(g,r,a)(w)l°gP(al g,r) 
(g,r,a) 
where g0 is the predicate of the root feature 
structure. The justification for this feature is 
that if f-structures were in fact a tree, ft(w) 
would be the (logarithm of) a probability dis- 
tribution over them. The auxiliary feature ft 
is defective in many ways. Because LFG f- 
structures are DAGs with reentrancies rather 
than trees we double count certain arguments, 
so ft is certainly not the logarithm of a prob- 
ability distribution (which is why we stressed 
that our approach does not require an auxiliary 
distribution to be a distribution). 
The number of governor-argument tuples 
found in different parses of the same sentence 
can vary markedly. Since the conditional prob- 
abilities P(alg, r) are usually very small, we 
found that ft(w) was strongly related to the 
number of tuples found in w, so the parse with 
the smaller number of tuples usually obtains the 
higher fl score. We tried to address this by 
adding two additional features. We set fc(w) to 
be the number of tuples in w, i.e.: 
fc(w) = Z n(g,r,a)(w) • 
(g,r,a) 
Then we set .Q(w) = h(w)/L(w), i.e.,/,(w) is 
the average log probability of a lexical depen- 
dency tuple under the auxiliary lexical distribu- 
tion. We performed our experiments with ft as 
the sole auxiliary distribution, and with ft, fe 
and fn as three auxiliary distributions. 
Because our corpora were so small, we trained 
and tested these models using a 10-fold cross- 
validation paradigm; the cumulative results are 
shown in Table 1. On each fold we evaluated 
each model in two ways. The correct parses 
measure simply counts the number of test sen- 
tences for which the estimated model assigns 
its maximum parse probability to the correct 
parse, with ties broken randomly. The pseudo- 
likelihood measure is the pseudo-likelihood of 
test set parses; i.e., the conditional probability 
of the test parses given their yields. We actu- 
ally report the negative log of this measure, so a 
smaller score corresponds to better performance 
here. The correct parses measure is most closely 
related to parser performance, but the pseudo- 
likelihood measure is more closely related to the 
quantity we are optimizing and may be more 
relevant to applications where the parser has to 
return a certainty factor associated with each 
parse. 
Table 1 also provides the number of indistin- 
guishable sentences under each model. A sen- 
tence y is indistinguishable with respect to fea- 
tures f iff f(wc) : f(w'), where wc is the correct 
parse of y and wc ~ w I E ~(y), i.e., the feature 
values of correct parse of y are identical to the 
feature values of some other parse of y. If a 
sentence is indistinguishable it is not possible 
to assign its correct parse a (conditional) prob- 
ability higher than the (conditional) probability 
assigned to other parses, so all else being equal 
we would expect a SUBG with with fewer indis- 
tinguishable sentences to perform better than 
one with more. 
Adding auxiliary features reduced the already 
low number of indistinguishable sentences in the 
Verbmobil corpus by only 11%, while it reduced 
the number of indistinguishable sentences in the 
Homecentre corpus by 24%. This probably re- 
flects the fact that the feature set was designed 
by inspecting only the Verbmobil corpus. 
We must admit disappointment with these 
results. Adding auxiliary lexical features im- 
proves the correct parses measure only slightly, 
and degrades rather than improves performance 
on the pseudo-likelihood measure. Perhaps this 
is due to the fact that adding auxiliary features 
increases the dimensionality of the feature vec- 
tor f, so the pseudo-likelihood scores with dif- 
ferent numbers of features are not strictly com- 
parable. 
The small improvement in the correct parses 
measure is typical of the improvement we might 
expect to achieve by adding a "good" non- 
auxiliary feature, but given the importance usu- 
ally placed on lexical dependencies in statistical 
models one might have expected more improve- 
ment. Probably the poor performance is due 
in part to the fairly large differences between 
the parses from which the lexical dependencies 
were estimated and the parses produced by the 
LFG. LFG parses are very detailed, and many 
ambiguities depend on the precise grammatical 
159 
Verbmobil corpus (324 sentences, 172 non-auxiliary features) 
Auxiliary features used Indistinguishable Correct - log PL 
(none) 9 180 401.3 
fl 8 183 401.6 
f,, fc, .f. 8 180.5 404.0 
Homecentre corpus (481 sentences, 186 non-auxiliary features) 
Auxiliary features used Indistinguishable Correct - log PL 
(none) 45 283.25 580.6 
fl 34 284 580.6 
f l, f c, f n 34 285 582.2 
Table h The effect of adding auxiliary lexical dependency features to a SLFG. The auxiliary 
features are described in the text. The column labelled "indistinguishable" gives the number of 
indistinguishable sentences with respect to each feature set, while "correct" and "- log PL" give 
the correct parses and pseudo-likelihood measures respectively. 
relationship holding between a predicate and its 
argument. It could also be that better perfor- 
mance could be achieved if the lexical dependen- 
cies were estimated from a corpus more closely 
related to the actual test corpus. For example, 
the verb feed in the Homecentre corpus is used in 
the sense of "insert (paper into printer)", which 
hardly seems to be a prototypical usage. 
Note that overall system performance is quite 
good; taking the unambiguous sentences into 
account the combined LFG parser and statisti- 
cal model finds the correct parse for 73% of the 
Verbmobil test sentences and 80% of the Home- 
centre test sentences. On just the ambiguous 
sentences, our system selects the correct parse 
for 56% of the Verbmobil test sentences and 59% 
of the Homecentre test sentences. 
6 Conclusion 
This paper has presented a method for incorpo- 
rating auxiliary distributional information gath- 
ered by other means possibly from other corpora 
into a Stochastic "Unification-based" Grammar 
(SUBG). This permits one to incorporate de- 
pendencies into a SUBG which probably can- 
not be estimated directly from the small UBG 
parsed corpora available today. It has the virtue 
that it can incorporate several auxiliary dis- 
tributions simultaneously, and because it asso- 
ciates each auxiliary distribution with its own 
"weight" parameter, it can scale the contribu- 
tions of each auxiliary distribution toward the 
final estimated distribution, or even ignore it 
entirely. We have applied this to incorporate 
lexical selectional preference information into 
a Stochastic Lexical-Functional Grammar, but 
the technique generalizes to stochastic versions 
of HPSGs, categorial grammars and transfor- 
mational grammars. An obvious extension of 
this work, which we hope will be persued in the 
future, is to apply these techniques in broad- 
coverage feature-based TAG parsers. 

References 

Steven P. Abney. 1997. Stochastic Attribute-Value Grammars. Computational Linguistics, 23(4):597-617. 

Glenn Carroll and Mats Rooth. 1998. Valence 
induction with a head-lexicalized PCFG. In 
Proceedings of EMNLP-3, Granada. 

Frederick Jelinek. 1997. Statistical Methods for 
Speech Recognition. The MIT Press, Cambridge, Massachusetts. 

Mark Johnson, Stuart Geman, Stephen Canon, 
Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic "unification-based" grammars. In The Proceedings of the 37th Annual 
Conference of the Association for Computational Linguistics, pages 535-541, San Francisco. Morgan Kaufmann. 

Ronald M. Kaplan and Joan Bresnan. 1982. 
Lexical-Functional Grammar: A formal system for grammatical representation. In Joan 
Bresnan, editor, The Mental Representation 
of Grammatical Relations, chapter 4, pages 
173-281. The MIT Press. 

Ronald M. Kaplan. 1995. The formal architecture of LFG. In Mary Dalrymple, Ronald M. 
Kaplan, John T. Maxwell III, and Annie 
Zaenen, editors, Formal Issues in Lexical-Functional Grammar, number 47 in CSLI 
Lecture Notes Series, chapter 1, pages 7-28. 
CSLI Publications. 

John T. Maxwell III and Ronald M. Kaplan. 
1995. A method for disjunctive constraint 
satisfaction. In Mary Dalrymple, Ronald M. 
Kaplan, John T. Maxwell III, and Annie 
Zaenen, editors, Formal Issues in Lexical- 
Functional Grammar, number 47 in CSLI 
Lecture Notes Series, chapter 14, pages 381- 
481. CSLI Publications. 

Mats Rooth, Stefan Riezler, Detlef Prescher, 
Glenn Carroll,, and Franz Beil. 1999. Inducing a semantically annotated lexicon via EM-based clustering. In Proceedings of the 37th 
Annual Meeting of the Association for Computational Linguistics, San Francisco. Morgan Kaufmann. 
