Estimation of Stochastic Attribute-Value Grammars using an 
Informative Sample 
Miles Osborne 
osborne~let.rug.nl 
Rijksuniversiteit Groningen, The Netherlands* 
Abstract 
We argue that some of the computational complexity 
associated with estimation of stochastic attribute- 
value grammars can be reduced by training upon an 
informative subset of the full training set. Results 
using the parsed Wall Street Journal corpus show 
that in some circumstances, it is possible to obtain 
better estimation results using an informative sam- 
ple than when training upon all the available ma- 
terial. Further experimentation demonstrates that 
with unlexicalised models, a Gaussian prior can re- 
duce overfitting. However, when models are lexi- 
ealised and contain overlapping features, overfitting 
does not seem to be a problem, and a Gmlssian prior 
makes minimal difference to performance. Our ap- 
proach is applicable for situal;ions when there are 
an infeasibly large mnnber of parses in the training 
set, or else for when recovery of these parses fl'om 
a packed representation is itself comi)utationally ex- 
pensive. 
1 Introduction 
Abney showed that attribute-value grammars can- 
not be modelled adequately using statistical tech- 
niques which assume that statistical dependencies 
are accidental (Ablmy, 1997). Instead of using a 
model class that assumed independence, Abney sug- 
gested using Random Fields Models (RFMs) tbr 
attribute-value grmnmars. RFMs deal with the 
graphical structure of a parse. Because they do not 
make independence assumptions about the stochas- 
tic generation process that might have produced 
some parse, they are able to model correctly depen- 
dencies that exist within parses. 
When estimating standardly-formulated RFMs, it 
is necessary to sum over all parses licensed by the 
grammar. For many broad coverage natural lan- 
guage grammars, this might involve summing over 
an exponential number of parses. This would make 
the task eomtmtationally intractable. Almey, fol- 
lowing the lead of Lafferty et al, suggested a Monte 
* Current address: osborne@eogsei.ed.ae.uk, University of 
Edinburgh, Division of Informaties, 2 Bueeleuch Place, EII8 
9LW, Scotland. 
Carlo simulation as a way of reducing the computa- 
tional burden associated with RFM estimation (Laf- 
ferty et al., 1997). However, Johnson ct al consid- 
ered the form of sampling used in this sinmlation 
(Metropolis-Hastings) intractable (Johnson et M., 
1999). Instead, they proposed an Mternative strat- 
egy that redefined the estimation task. It was argued 
that this redefinition made estimation eomtmtation- 
Mly simple enough that a Monte Carlo simulation 
was unnecessary. They presented results obtained 
using a small unlexicalised model trained on a mod- 
est corlms. 
Unfortunately, Johnson et al assumed it was possi- 
ble to retrieve all parses licensed by a grmnmar when 
parsing a given training set. For us, this was not 
the case. In our experiments with a manually writ- 
ten broad coverage Definite Clause Grammar (DCG) 
(Briscoe and Carroll, 1996), we were only able to re- 
cover M1 parses for Wall Street .Journal sentences 
that were at most 13 tokens long within acceptable 
time and space bounds on comtmtation. When we 
used an incremental Minilnum Description Length 
(MDL) based learner to extend the coverage of our 
mmmally written gralnular (froul roughly 6()~ to 
around 90% of the parsed Wall Street .Jouriml), the 
situation became worse. Sentence ambiguity consid- 
erably increased. We were then only able to recover 
all parses for Wall Street Journal sentences that were 
at most 6 tokens long (Osborne, 1999). 
We can however, and usually in polynomial time, 
recover up to 30 parses for sentences up to 30 tokens 
long when we use a probabilistic unpacking mecha- 
nism (Carroll and Briscoe, 1992). (Longer sentences 
than 30 tokens can be parsed, but the nmnber of 
parses we can recover for them drops off rapidly). 1 
However, 30 is far less tlmn the maximum number 
l"vVe made an attempt to determine the maximum num- 
ber of parses our grammar might assign to sentences. On 
a 450MIIz Ultra Spare 80 with 2 G'b of real memory, with 
a limit of at most 1000 parses per sentence, and allowing 
no more than 100 CPU seconds per sentence, we found that 
sentence ambiguity increased exponentially with respect to 
sentence length. Sentences with 30 tokens had an estimated 
average of 866 parses (standard deviation 290.4). Without 
the limit of 1000 parses per sentence, it seems likely that this 
average would incrc, ase. 
586 
of parses per sentence o111' grammar mighl, assign to 
Wall Stl"ec't Journal sent;enees. Any training set we 
have a(:eess to will therefore be l|eeessarily limite(l 
in size. 
We therefore need an estimation strategy that 
takes seriously the issue of extracting the 1)esl, per- 
refinance fl'om a limited size training Met. A limited 
size tra.ining sol; means one ereate(l l)y retrieving at 
most n t)arses per Ment(mee. Although we (:annot re- 
cover all t)ossil)le i)arses~ we (lo }lave a choice as to 
which llarses estimation should 1)e based Ul)On. 
Our ai)proach to the prol)lem of making I{FM es- 
timation feasible, ibr our highly amt)iguous I)CG is 
to seek ol\]|; an ivformativc samt)le and train ui)on 
that. We (lo not redefine the estimation task in a 
non-sl;al~tlard w;~y, 1101' (lo we llSe a ~{o\]lte Carlo siln- 
ulation. 
We (:all a salul)lc informative if it 1)oth leads to 
the select;ion of a 111ollol that does not mldertit or 
overfit, and also is typical of t'utm'e samples, l)esl)itc 
()lie's intuitions, an infornmtive saml)le might be a 
prol)er subset of the fifll training set. This means 
that estinlation using the int'ornmtiv(; sample might 
yield 1)etter results than estimation using all of the 
l;rainhlg Met;. 
The ):(;st of this 1)aper is as tbllows, l,'irstly we 
introduce RFMs. Then we show how they nlay be 
esl;imated and how an infbrmative saml)le might 1)e 
identified. Nexl;, we give details of the, a(;tribute- 
vahle gramnlar we use, all(t show \]lOW we ~o at)otlt 
modelling it. We then i)resent two sets of experi- 
mel)ts. The first set is small scale, and art! de.signed 
to show the existent;e of ;m inti)rmative sample. The 
second ski of CXl)erilll(;llI, S al'e larger in scale, an(1 
build upon the COml)utational savil|gS we are al)le 
to achieve using a probabilistic Unl)acking strategy. 
They show how large me(Ms (two orders of magni- 
tude larger than those reported by Johnson ctal) 
can 1)e estimated using the l)arsed Wall Street .lour- 
hal eort)us. Overlitting is shown to take place. They 
also show how this overfitting can be (partially) re- 
duced by using a Gaussian prior. Finally, we end 
with SOllle COllllllelltS Oil Ollr WOl.k. 
2 Random Field Models 
Here we show how attribute-wflue grammars may be 
modelle(1 using RFMs. Although our commentary is 
in terms of RFMs and grammars, it should t)e ol)- 
vious that RFM technology can be applied to other 
estimation see.narios. 
Let G be an attribute-value grammar, D the set 
of sentences within the string-set defined lly L(G) 
and ~ the union of the set of parses assigne(1 to 
each sentence in D by the gramnmr G. A Random 
Field Model, M, consist of two coral)orients: a set of 
features, F and a set; of 'wei.qhts, A. 
l?eatures are the basle building blocks of RFMs. 
They enable the system designer to spccit)~ {;lie key 
asl)ects of what it; takes to ditferentiate one 1)arse 
from a11other parse. Each feature is a t'lmetion from 
a 1)arse to an integer. Her(.', the integer value as- 
sociated with a feature is interpreted as the nmn- 
ber of times a feature 'matches' (is 'active') with 
a parse. Note features should not be confllsed with 
features as found in feature-value t)undles (these will 
be called atl;ril)utes instead). \]Peatures are usually 
manually selected by the sysl;ein designer. 
The other component of a RFM, A, is a set of 
weights, hffornmlly, weights tell its how ti;atures are 
to be used when nlodellillg parses. For exanlple, an 
active feature with a large weight might indicate that 
some parse had a higtl prolmlfility. Each weight Ai is 
associated with a thatm'e fi. Weights arc' real-valued 
nmnl)ers an(l ~:H'O autonmtically deternfined 113: an es- 
timation process (for example using hnproved Itera- 
tire Scaling (LaflL'rty et al., 1997)). One of the nice 
l)roI)erties of Rl.i'Ms is that 1111o likelihood fiuw£ion 
of a RFM is strictly concave. This means 1;hat there 
~/t"e. 11o h)cal lllillillla~ and so wc can be, l)e sure that 
sealing will result in estinmtion of a 11.1,'54 that is 
glol)ally ot)timal. 
The (unnormalised) total weight of a i)arse :c, 
'(J(:r), is a flulction of the. k feaLures that are 'active' 
on a 1)arse: 
k 
.,/,(.;) = (.)) (1) 
i=l 
The prol)ability of a parse, P(x I M), is simply 
the result of norm~dising the total weight associated 
with that parse: 
J'(:,, IM) = 12) 
z -- (a) 
yGf~ 
The inl;erpretation of this I)robability depends upon 
the apt)lication of tile RFM. Here, we use parse prol)- 
abilities to rettect preferences for parses. 
When using RFMs for parse selection, we sin> 
ply select the parse that ma.ximises ~/;(:1:). In these 
circumstances, there is 11o need to nornlalise (com- 
pute Z). Also, when comtmting ,/~(:c) for comi)eting 
parses, there is no built-in bias towards shorter (or 
longer) derivations, and so no need to normalise with 
respect to deriw~tion length/ 
2The reason there is no need to normalisc with respect to 
derivation length is that features can have positive o1" nega- 
tive weights. The weight of a parse will ttlcrcforc not always 
monotonically increase with respect to the re,tuber of active 
ti~atm'cs. 
587 
3 RFM Estimation and Selection of 
the Informative Sample 
We now sketch how RFMs may be estimated and 
then outline how we seek out an informa.tive smnple. 
We use hnproved Iterative Scaling (IIS) to esti- 
mate RFMs. In outline, the IIS algorithm is as fol- 
lows: 
1. Start with a reference distribution H,, a set of 
features F and a set of weights A. Let M be 
the RFM defined using F and A. 
2. Initialise all weights to zero. This makes tile 
initial model uniform. 
3. Compute the expectation of each feature w.r.t 
R. 
4. For each feature fi 
(a) Find a weight ~; that equates the expecta- 
tion of fi w.r.t/?, and the expectation of fi 
w.r.t M. 
(b) Ileplace the old value of ki with 21. 
5. If the model has converged to/£, output M. 
6. Otherwise, go to step 4 
Tile key step here is 4a, computing the expectations 
• of features w.r.t the RFM. This involves calculating 
the probability of a parse, which, as we saw fronl 
equation 2, requires a summation over all parses in 
ft. 
We seek out an informative sample ~l (fh C ~) 
as follows: 
I. Pick out from ~ a sample of size n. 
2. Estimate a model using that smnple and evalu- 
ate it. 
3. If the model just estimated shows signs of over- 
fitting (with respect to an unseen held-out data 
set), halt and output the inodel. 
4. Otherwise, increase n and go back to step 1. 
Our approach is motivated by tile following (par- 
tially related) observations: 
• Because we use a non-Imrmnetric model class 
and select an instance of it in terlns of some 
sample (section 5 gives details), a stochastic 
complexity argument tells us that an overly sim- 
ple model (resulting from a small sample) is 
likely to underfit. Likewise, an overly complex 
model (resulting from a large sample) is likely 
to overfit. An informative samI)le will therefore 
relate to a model that does not under or overfit. 
• On average, an informative sample will be %yp- 
ical' of future samples. For many reaMife situ- 
ations, this set is likely to be small relative to 
the size of the full training set. 
We incorporate the first observation through our 
search mechanism. Because we start with small sam- 
pies and gradually increase their size, we remain 
within the donmin of etliciently recoverable samples. 
The second observation is (largely) incorporated 
in the way we pick samples. The experimental sec- 
tion of this paper goes into the relevant details. 
Note our approach is heuristic: we cmmot afford 
to evahmte all 21~1 possible training sets. The actual 
size of the informative sample fit will depend both 
tile Ill)On the model class used and the maximum 
sentence length we can de~,l with. We would ex- 
pect: richer, lexicalised models to exhibit overfitting 
with slnaller samples than would be the case with 
unlexicalised models. We would expect the size of 
an informative sample to increase as the maxilnum 
sentence length increased. 
There are similarities between our approach and 
with estimation using MDL (Rissanen, 1989). How- 
ever, our implementation does not explicitly attempt 
to minimise code lengths. Also, there are similari- 
ties with importance sampling approaches to RFM 
estimation (such as (Chen and ll,osenfeld, 1999a)). 
However, such attempts do not miifinfise under or 
overfitting. 
4 The Grammar 
The grammar we model with I/andom Fields, (called 
the Ta 9 Sequence Grammar (Briseoe and Carroll, 
1996), or TSG for short) was developed with regard 
to coverage, and when compiled consists of 455 Def- 
inite Clause Grammar (DCG) rules. It does not 
parse sequences of words directly, but instead as- 
signs derivations to sequences of part-of-speech tags 
(using the CLAWS2 tagset. The grammar is rela- 
tively shallow, (for exmnple, it does not fltlly anal- 
yse unbounded dependencies) but it does make an 
attelnpt to deal with coilunou constructions, such as 
dates or names, commonly found in corpora, but of 
little, theoretical interest. Furthermore, it integrates 
into the syntax a text gramma.r, grouping utterances 
into units that reduce the overall ambiguity. 
5 Modelling the Grammar 
Modelling the TSG with respect to the parsed Wall 
Street .\]ournal consists of two steps: creation of a 
feature set and definition of the reference distribu- 
tion. 
Our feature set is created by parsing sentences in 
the training set (~br), and using earl, parse to ill- 
stantiate templates. Each template defines a family 
of features. At present, the templates we use are 
somewhat ad-hoc. However, they are motivated by 
the observations that linguistically-stipulated units 
(DCG rules) are informative, trod that ninny DCG 
apl)lications in preferred parses can be predicted us- 
ing lexical information. 
588 
AP/al:unimlie{h.'d I 
A 1/aI) p i :unimpeded 
unimI)eded PP/t)I:by \[ 
P1/iml :by 
by N1/n:trafli{: 
I 
trafl\]{: 
Figure 1: TSG Parse Fragumnt 
The first template creates features that count 
{;lie numl)er of tinms a. DUG instantiationis i)resent 
within ,2 1}arse. a For examt}le , Sul)p{}s{~ we 1)arse{t 
the Wall Street Journal AP: 
1 unimpeded 1}y t:ra{lic 
A parse tree generated by TSG nfight lie as shown 
in figure 1. Here, to s:~ve on Sl}ace, wc have labdled 
each interior node in the parse tree with TSG rule 
names, and not attribut(;-valu(~ bun(lies. Further- 
more, we have mmota.t('xl each node with the lmad 
w(}rd of tim l}hrase in question. Within ollr gl;aill- 
mar, heads arc (usually) ext)lMtly marke{t. This 
1110;/,118 W(~ do l\]ot \]l;~v{~ to Ill&k(~ ;lily gllossos w}lcll 
identit\[ying the head of a. local tree. With head in- 
foi'mtd;io\]b we are alo/e to lexicalise models. \Ve haa;e 
suppressed taggillg information. 
For {'.xamp\]e, a \]hature (h'Jin(;d using this t(;nlplat{; 
might (:O1111t tho, nuinber ()f times th(! we saw: 
AP/at I 
A1/at/1)1 
111 a 1)arse. Such features r(~coi'd sore( 2 (if the context 
of the rule a.t}p\]i(:ation, in that rule al}t}Iication8 that 
differ ii1 terms of how attributes are bound will 1}e 
modelled by (litlhrent thatures. 
Our se{'ond total}late creates features that al'{'~ par- 
tially lexicalised. I~br each lo{:al tree (of depth one) 
that has a \]?P daughter, we create a feature that 
counts the lmmber of times that h)cal tree, de(:orated 
with the head-woM of the I'l', was seen in a. parse. 
An cxmnple of such ;1 lexicMised feature would 1}e: 
A1/apt}l 
I 
PI)/til:l)y 
3Note, all (}111" fo.al;/ll'es Slll)i)r(?ss ;tlly t{!l'nlillals thgtl, al)i}em' 
in a h}caI 1,Fe(!. Lexical informaI;ioll is in{:luded when we decide 
to lexicalise features. 
These featm'cs are designed to model PP attach- 
ments that can be resolved using the head of the 
PP. 
The thh'd mid tinM template creates featuros that 
are again partiMly lexicalised. This time, we create 
local trees of det}th one that are, decorated with the 
head word. For example, here is one such feature: 
AP/al:mfimpeded I 
A1/appl 
Note the second and third templates result in fea- 
tures that overlap with features resulting fl'om at)- 
i}\]icati(ms of the first template. 
We create the reference distribution 1~ (an associ- 
ation of t)r{}l}al)i\]ities with TSG parses of sentences, 
such that the t}robabilities reflect 1}a.rse i)references) 
using the following process: 
1. Extra{;t some sami}le f~T (using the al)l)roach 
mentioned in sc(:tion 3). 
2. For each sentence in tim sample, for each l)arse 
of that sent;encc', {:Olnl)ute the '(lista.ncc' be- 
tween the TSG 1}mse and the WSJ refereuce 
parse. \]1\] our at)t)roach, dista.nce is cM{:lfla.tc(1 
in tcl7111s of a weighted Slltll of crossing rates, re- 
call and 1}recision. Mininlising it maximises our 
definition of parse plausibility. 4 However, there 
is nothing inherently crucial about this decision. 
Auy othc'r objective flmction (thaJ; can l)c ret)- 
r(~sent('.(l as an CXl}Oncntial distribution) couh\] 
1)e used instead. 
3. Normalise the distan('es, such that for some ,sen- 
tcn(:e, tim sum of tim distances of all rt~cov- 
O,l.'od ~\[?SG t)al"S(~S \]\['(/1" that soii|;(!ilCO, is a COllSt£tilt 
a.cross all sento.nces. Nornmlising in this man- 
ner ensures that each sentence is cquil)robal}le 
0"emcmber that \]{FM probabilities are in terms 
of I}a.rse lir{'.fl~r{'.nces, and not probability of oc- 
{:llrr{HIee ill 8{}111{~ (;orl)llS). 
4. Map the norinalised distances into 1}robabili- 
ties. If d(p) is the normalised {listance of TSG 
l/;}~l"Se p, then associate with parse 1) the refer- 
(race probability given by the maximum likeli- 
hood estimator: 
rl(1,) (4) 
Our approach therefore gives t}artial cl'e(lit (a 11oil- 
zero reference l)robability) to a.ll parses in ~z. /2, is 
thcreibr(; not as discontimlous as the equivalent dis- 
trit)ution used by Johnson at al. We therefl)re do not 
need to use simulated annea.ling o1' other numerically 
intensive techniques to cstiinate models. 
4Ore' distanc(~ mo.l;ric is the same one used I}y llekto{m 
(ltektoen, 19.97) 
589 
6 Experiments 
Here we present two sets of experiments. The first 
set demonstrate the existence of an informative sam- 
ple. It also shows some of the characteristics of three 
smnpling strategies. The second set of experiments 
is larger in scale, and show RFMs (both lexicalised 
and unlexicalised) estimated using sentences up to 
30 tokens long. Also, the effects of a Gaussian prior 
are demonstrated as a way of (partially) dealing with 
overfitting. 
6.1 Testing the Various Sampling 
Strategies 
In order to see how various sizes of sample related to 
estimation accuracy and whether we could achieve 
similar levels of performm~ce without recovering all 
possible parses, we ran the following experiments. 
We used a model consisting of features that were 
defined using all three templates. We also threw 
away all features that occurred less than two times in 
the training set. We randomly split; the Wall Street 
Journal into disjoint training, held-out and testing 
sets. All sentences in the training and held-out sets 
were at most 14 tokens long. Sentences ill the test- 
tug set, were at most 30 tokens long. There were 
6626 sentences in the training set, 98 sentences in 
the held-out set and 441 sentences in tile testing set. 
Sentences in the held-out set had on average 12.6 
parses, whilst sentences in the testing-set had on av- 
erage 60.6 parses per sentence. 
The held-out set was used to decide which model 
performed best. Actual performmme of the models 
should be judged with rest)ect to the testing set. 
Evaluation was in terIns of exact match: tbr each 
sentence in the test set, we awarded ourselves a 
t)oint if the RFM ranked highest the same parse that 
was ranked highest using the reference probabilities. 
When evahmting with respect to the held-out set, 
we recovered all parses for sentences in the held-out 
set. When evaluating with respect to the testing-set, 
we recovered at most 100 parses per sentence. 
For each run, we ran IIS for the same number 
of iterations (20). In each case, we evaluated the 
RFM after each other iteration and recorded the best 
classification pertbrmance. This step was designed 
to avoid overfitting distorting our results. 
Figure 2 shows the results we obtained with pos- 
sible ways of picking 'typical' samples. The first 
column shows the maxinmm number of parses per 
sentences that we retrieved in each sample. 
The second column shows the size of the sample 
(in parses). 
The other cohmms give classification accuracy re- 
sults (a percentage) with respect to the testing set. 
In parentheses, we give performance with respect; to 
the held-out set. 
The column marked Rand shows the performance 
Max parses Size 
1 6626 
2 12331 
3 17026 
5 24878 
10 39581 
100 119694 
1000 246686 
oo 267400 
Rand SCFG Ref 
25.2 (51.7) 23.3 (59.0) 23.4 (50.0) 
37.9 (63.0) 40.4 (60.3) 40.4 (60.0) 
43.2 (65.5) 43.7 (63.8) 43.7 (63.8) 
43.7 (70.2) 45.8 (69.5) 45.8 (69.5) 
47.4 (72.0) 47.0 (70.0) 46.9 (70.0) 
45.0 (68.7) 45.0 (68.0) 45.0 (68.0) 
44.4 (67.4) 43.0 (67.0) 43.0 (67.0) 
43.0 (66.0) 43.0 (66.0) 43.0 (66.0) 
Figure 2: Results with various sampling strategies 
of runs that used a sample that contained parses 
which were randomly and uniformly selected out of 
the set, of all possible parses. The classification ac- 
curacy results for this sampler are averaged over 10 
runs. 
The column marked SCFG shows the results ob- 
tained when using a salnple that contained 1)arses 
that were retrieved using the probabilistic unI)acking 
strategy. This did not involve retrieving all possible 
parses for each sentence in the training set,. Since 
there is no random component, the results arc fl'om a 
single run. Here, parses were ranked using a stochas- 
tic context free backbone approximation of TSG. Pa- 
rameters were estimated using simple counting. 
FinMly, the eohunn marked Ref shows the re- 
sults ol)tained when USillg a sample that contained 
the overall n-best parses per sentence, as defined in 
terms of the reference distril)ution. 
As a baseline, a nlodel containing randomly as- 
signed weights produced a classification accuracy of 
45% on the held-out sentences. These results were 
averaged over 10 runs. 
As can be seen, increasing the sainple size pro- 
duces better results (for ca& smnl)ling strategy). 
Around a smnple size of 40k parses, overfitting starts 
to manifest, and perIbrmance bottoms-out. One of 
these is therefore our inforinative sample. Note that 
the best smnple (40k parses) is less than 20% of the 
total possible training set. 
The ditference between the various samplers is 
marginal, with a slight preference for Rand. How- 
ever the fact that SUFG sampling seems to do ahnost 
as well as Rand sampling, and fllrthermore does not 
require unpacking all parses, makes it the sampling 
strategy of choice. 
SCFG sampling is biased in the sense that the 
sample produced using it will tend to concentrate 
around those parses that are all close to the best, 
parses. Rand smnpling is unbiased, and, apart 
h'om the practical problems of having to recover all 
parses, nfight in some circumstances be better than 
SCFG sampling. At the time of writing this paper, 
it was unclear whether we could combine SCFG with 
Rand sampling -sample parses from the flfll distribu- 
590 
lion without unpacking all parses. We suspect that 
for i)robabilistic unt)acking to be efficient, it nmst 
\]:ely upon some non-uniform distribution. Unpack- 
ing randomly and uniformly would probably result 
in a large loss in computational eliiciency. 
6.2 Larger Scale Evaluation 
Here we show results using a larger salnl)le and test- 
ing set. We also show the effects of lexicalisation, 
overtitting, and overfitting avoidance using a Gaus- 
sian prior. Strictly speaking this section could have 
been omitted fl'om the paper. However, if one views 
estimation using an informative sami)le as overfit- 
ling avoi(lance, then estimation using a Gaussian 
l)rior Call be seen as another, complementary take 
on the problem. 
The experimental setup was as follows. We rall- 
domly split the Wall St, reel: Journal corpus into a 
training set and a testing set. Both sets contained 
sentence.s t;hat were at most 30 tokens hmg. When 
creating the set of parses used to estimate Ii.FMs, we 
used the SCFG approach, and retained the top 25 
parses per sentence. Within the training set (arising 
Dora 16, 200 sentences), there were 405,020 parses. 
The testing set consisted of 466 sentences, with an 
average of 60.6 parses per sentence. 
When evahmtillg, we retrieved at lllOSt 100 lmrscs 
per sentence in the testing set and scored them using 
our reference distribution. As lmfore, we awarded 
ourselves a i)oinl; if the most probable testing parse 
(in terms of the I/.MF) coincided with the most t)rol)- 
able parse (in terms of the reference distribution). In 
all eases, we ran IIS tbr 100 iterations. 
For the tirst experiment, we used just the first 
telnp\]at('. (features that rc'la.t(;d to DC(I insl;antia- 
tions) to create model l; the second experiment uso.d 
the first and second teml)lat(~s (additional t'eatm'o.s 
relating to PP attachment) to create model 2. The 
linal experiment used all three templat('~s (additional 
fea,tllres that were head-lexicalised) to create model 
3. 
The three mo(lels contained 39,230, 65,568 and 
278, 127 featm:es respectively, 
As a baseline, a model containing randomly as- 
signed weights achieved a 22% classification accu- 
racy. These results were averaged over 10 runs. Fig- 
ure 3 shows the classification accuracy using models 
1, 2 and 3. 
As can 1)e seen, the larger scale exl)erimental 
results were better than those achieved using the 
smaller samples (mentioned in section 6.1). The rea- 
Sell for this was because we used longer sentc,11ces. 
The. informative sainple derivable Kern such a train- 
ing set was likely to be larger (more representative of 
54 
52 
5O 
o>, 
~, 48 
o < 
46 
44 
42 
I I i . I, . A.. I i ... I .. 
~'\.,/',--- "'. ..'" .... ..model1 ........ 
"ff ~ "~ ~L model2 ...... 
/....,' ~ model3 .... _ 
,, ,,,,- ......... . ~_\/..~ \ 
10 20 30 40 50 60 70 80 90 100 
Iterations 
Figure 3: Classification Accuracy tbr Three Models 
Estinmted using Basic IIS 
56 --~ r r \] l 1 l 1 7 
54 
52 
50 
o~ ~: 4a 
46 
44 
42 
model1 -- -- 
model2 ..... 
0 10 20 30 40 50 60 70 80 90 1 O0 
Iterations 
Figure .l: Classification Accuracy for .\[hre(. Models 
Estinmted using a Gmlssian Prior and IIS 
the population) than the informative sample deriv- 
al)led from a training set using shorter, less syntat'- 
tically (Xmll)lex senten(:es. With the unle.xicalised 
model, we see (:lear signs of overfitting. Model 2 
overfits even more so. For reasons that are unclear, 
we see that the larger model 3 does not ai)pem: to 
exhibit overtitting. 
We next used the Gaussian Prior method of 
Chen and Rosenfeld to reduce overfitting (Chen 
and Rosenfeld, 1999b). This involved integrating 
a Gaussian prior (with a zero mean) into Ills and 
searching for the model that maximised the, prod- 
uct of the likelihood and prior prolmbilities. For the 
experiments reported here, we used a single wlri- 
ante over the entire model (better results might be 
achievable if multiple variances were used, i)erhaps 
with one variance per telnl)late type). The aetllal 
value of the variance was t'cmnd by trial-and-error. 
Itowever, optimisation using a held-out set is easy 
to achieve,. 
591 
We repeated the large-scale experiment, but this 
time using a Gaussian prior. Figure 4 shows the 
classification accuracy of the models when using a 
Gmlssian Prior. 
When we used a Gaussian prior, we fotmd that all 
models showed signs of imt)rovenmnt (allbeit with 
varying degrees): performance either increased, or 
else did not decrease with respect to the munber 
of iterations, Still, model 2 continued to underper- 
form. Model 3 seemed most resistent to the prior. 
It theretbre appears that a Gaussian prior is most 
useful for unlexicalised models, and that for mod- 
els built from complex, overlapping features, other 
forms of smoothing must be used instead. 
7 Comments 
We argued that RFM estimation tbr broad-coverage 
attribute-valued grammars could be made eompu- 
tationally tractable by training upon an inforlna- 
tive sample. Our small-scale experiments suggested 
that using those parses that could be etliciently un- 
packed (SCFG sampling) was ahnost as effective as 
sampling from all possible parses (R~and salnplillg). 
Also, we saw that models should not be both built 
and also estimated using all possible parses. Better 
results can be obtained when models m'e built and 
trained using an intbrmative san@e. 
Given the relationshi I) between sample size and 
model complexity, we see that when there is a dan- 
ger of overfitting, one should build models on the ba- 
sis of all informative set. Itowever, this leaves open 
the possil)ility of training such a model upon a su- 
1)erset of the, informative set;. Although we ha.re not 
tested this scenario, we believe that this would lead 
to t)etter results ttlan those achieved here. 
The larger scale experiments showed that I{FMs 
can be estimated using relatively long sentences. 
They also showed that a simple Gaussian prior could 
reduce the etfects of overfitting. However, they also 
showed that excessive overfitting probably required 
an alternative smoothing approach. 
The smaller and larger experiments can be both 
viewed as (complementary) ways of dealing with 
overfitting. We conjecture that of the two ap- 
proaches, the informative smnple al)proach is prefer- 
able as it deals with overfitting directly: overfitting 
results fi'om fitting to complex a model with too lit- 
tle data. 
Our ongoing research will concentrate upon 
stronger ways of dealing with overfitting in lexi- 
calised RFMs. One line we are pursuing is to com- 
bine a compression-based prior with an exponential 
model. This blends MDL with Maximum Entropy. 
We are also looking at alternative template sets. 
For example, we would probably benefit fi'om using 
templates that capture more of the syntactic context 
of a rule instantiation. 
Acknowledgments 
We would like to tliank Rob Malouf, Domfla Nie 
Gearailt and tim anonymous reviewers for com- 
ments. This work was supported by tile TMR 
Project Lcar'nin9 Computational Grammars. 

References 

St, even P. Atmey. 1997. Stochastic Attribute- 
Value Grmmnm:s. Computational Linguistics, 
23(4):597- 618, December. 

Miles Osborne 19§9. DCG induction using MDL 
and Parsed Corpora. In James Cussens, editor, 
Lcarnin9 Langua9 c in Logic, pages 63-71, Bled, 
Slovenia, June. 

Ted Briscoe and John Carroll. 1.996. Autolnatic 
Extraction of Subcategorization from Corpora. 
In Proceedings of the 5 th Conference on Applied 
NLP, p~ges 356-363, Washington, DC. 

John Carroll and Ted Briscoe. 1992. Probabilis- 
tic Normalisation and Unpacking of Paclmd Parse 
Forests for Unification-lmse, d Grmnmars. Ill Pro- 
cccdi,n9 s of the AAAI Fall Symposi'u,m on P~vb- 
abilistic AppTvach, es to Natural Language , pages 
33-38, Cambridge, MA. 

Stanley Chen and Honald l{osenfeld. 1999a. Effi- 
cient Sampling and Feature Selection in Whole 
Sentence Maxinmin Entrol)y Language Models. In 
ICA SSP '99. 

Stanley F. Chen and Ronald Rosenfeld. 1999b. 
A Gaussian Prior for Smoothing Maxinmm \]211- 
tropy Models. Technical Rel)ort CMU-CS-99-108, 
Carnegie Mellon University. 

Eirik Hektoen. 1997. Probabilistic Parse Select;ion 
Based on Semantic Cooet:llrl'ellees. \]ill Pl'og('.cd- 
ings oJ" th, e 5th l'ntc, r'national Wo~wkh, op on Parsing 
Tcch, nolo.qics, Cambridge, Massach'usctts, 1)ages 
113 122. 

Marl< Johnson, Stuart Geman, Stephen Cannon, 
Zhiyi Chi, and Stephan Riezler. 1999. Esl, inmtors 
for Stochastic "Unification-based" (~rammars. In 
37 th Annual Meeting of the ACL, 

J. Latferty, S. Della Pietra, and V. Della Pietra. 
1997. Inducing Features of Random Fields. 1EEE 
Transactions on Pattern Analysis and Mach, inc 
Intclligcncc, 19(4):380 393, April. 

Jorma Rissanen. 1989. Stochastic Complezity in 
Statistical i'nquiry. Series in Computer Science - 
Volmne 15. World Scientific. 
