In: Proceedings of CoNLL-2000 and LLL-2000, pages 7-12, Lisbon, Portugal, 2000. 
Corpus-Based Grammar Specialization 
Nicola Cancedda and Christer Samuelsson 
Xerox Research Centre Europe 
6, chemin de Maupertuis 
38240, Meylan, France 
{Nicola. Cancedda, Christer. Samuelsson}@xrce. xerox, com 
Abstract 
Broad-coverage grammars tend to be highly am- 
biguous. When such grammars are used in a 
restricted domain, it may be desirable to spe- 
cialize them, in effect trading some coverage for 
a reduction in ambiguity. Grammar specializa- 
tion is here given a novel formulation as an opti- 
mization problem, in which the search is guided 
by a global measure combining coverage, ambi- 
guity and grammar size. The method, applica- 
ble to any unification grammar with a phrase- 
structure backbone, is shown to be effective in 
specializing a broad-coverage LFG for French. 
1 Introduction 
Expressive grammar formalisms allow grammar 
developers to capture complex linguistic gener- 
alizations concisely and elegantly, thus greatly 
facilitating grammar development and main- 
tenance. Broad-coverage grammars, however, 
tend to overgenerate considerably, thus allowing 
large amounts of spurious ambiguity. If the ben- 
efits resulting from more concise grammatical 
descriptions are to outweigh the costs of spuri- 
ous ambiguity, the latter must be brought down. 
We here investigate a corpus-based compilation 
technique that reduces overgeneration and spu- 
rious ambiguity without jeopardizing coverage 
or burdening the grammar developer. 
The current work extends previous work 
on corpus-based grammar specialization, which 
applies variants of explanation-based learning 
(EBL) to grammars of natural s. The 
earliest work (Rayner, 1988; Samuelsson and 
Rayner, 1991) builds a specialized grammar by 
chunking together grammar rule combinations 
while parsing training examples. What rules to 
combine is specified by hand-coded criteria. 
Subsequent work (Rayner and Carter, 1996; 
Samuelsson, 1994) views the problem as that 
of cutting up each tree in a treebank of cor- 
rect parse trees into subtrees, after which the 
rule combinations corresponding to the subtrees 
determine the rules of the specialized gram- 
mar. This approach reports experimental re- 
sults, using the SRI Core Language Engine, 
(Alshawi, 1992), in the ATIS domain, of more 
than a 3-fold speedup at a cost of 5% in gram- 
matical coverage, the latter which is compen- 
sated by an increase in parsing accuracy. Later 
work (Samuelsson, 1994; Sima'an, 1999) at- 
tempts to automatically determine appropriate 
tree-cutting criteria, the former using local mea- 
sures, the latter using global ones. 
The current work reverts to the view of EBL 
as chunking grammar rules. It extends the 
latter work by formulating grammar special- 
ization as a global optimization problem over 
the space of all possible specialized grammars 
with an objective function based on the cover- 
age, ambiguity and size of the resulting gram- 
mar. The method was evaluated on the LFG 
grammar for French developed within the PAR- 
GRAM project (Butt et al., 1999), but it is 
applicable to any unification grammar with a 
phrase-structure backbone where the reference 
treebank contains all possible analyses for each 
training example, along with an indication of 
which one is the correct one. 
To explore the space of possible grammars, a 
special treebank representation was developed, 
called a \]folded treebank, which allows the ob- 
jective function to be computed very efficiently 
for each candidate grammar. This representa- 
tion relies on the fact that all possible parses 
returned by the original grammar for each train- 
ing sentence axe available and the fact that the 
grammar specialization never introduces new 
parses; it only removes existing ones. 
The rest of this paper is organized as follows: 
Section 2 describes the initial candidate gram- 
mar and the operators used to generate new 
candidate grammars from any given one. The 
function to be maximized is introduced and mo- 
tivated in Section 3. The folded treebank repre- 
sentation is described in Section 4, while Sec- 
tion 5 presents the experimental results. 
2 Unfolding and Specialization 
The initial grammar is the grammar underly- 
ing the subset of correct parses in the training 
set. This is in itself a specialization of the gram- 
mar which was used to parse the treebank, since 
some rules may not show up in any correct parse 
in the training set; experimental results for this 
first-order specialization are reported in (Can- 
cedda and Samuelsson, 2000). This grammar 
is further specialized by inhibiting rule combi- 
nations that show up in incorrect parses much 
more often than in correct parses. 
In more detail, we considered downward un- 
folding of grammar rules (see Fig.l)3 A gram- 
mar rule is unfolded downwards on one of the 
symbols in its right-hand side if it is replaced 
by a set of rules, each corresponding to the ex- 
pansion of the chosen symbol by means of an- 
other grammar rule. More formally, let G = 
(E, EN, S, R) be a context-free grammar, and 
let r,r' C R, k E .M + such that rhs(r) = aAfl, 
lal = k - 1, lhs(r') = A, rhs(r') = V. The rule 
adjunction of r I in the k th position of r is defined 
as a new rule RA(r, k, r ~) = r', such that: 
lhs(r") = lhs(r) 
rhs(r") = aVfl 
For unification grammars, we instead require 
lhs(r') U rhs(r)(k) 
lhs(r 1') = O(lhs(r)) 
rhs(r") = O(oLTfl ) 
where rhs(r)(k) is the kth symbol of rhs(r), 
where X t3 Y indicates that X and Y unify, and 
where 0 is the most general unifier of lhs(r ~) and 
rhs(r)(k). 
The downward rule unfolding of rule r on its 
k th position is then defined as: 
DRU(r, k) = 
1The converse operation, upward unfolding, was not 
used in the current experiments. 
{r'\[3r"\[r' = RA(r, k, r")\]} if ¢ 0 
= {r} otherwise 
It is easy to see that if all r I E DRU(r, k) are 
retained then the new grammar has exactly the 
same coverage as the old one. Once the rule 
has been unfolded, however, the grammar can 
be specialized. This involves inhibiting some 
rule combinations by simply removing the cor- 
responding newly created rules. Any subset 
X C_ DRU(r,k) is called a downward special- 
ization of rule r on the k th element of its rhs. 
Given a grammar, all possible (downward) 
unfoldings of its rules are considered and, for 
each unfolding, the specialization leading to the 
best increase in the objective function is deter- 
mined. The set of all such best specializations 
defines the set of candidate successor grammars. 
In the experiments, a simple hill-climbing algo- 
rithm was adopted. Other iterative-refinement 
schemes, such as simulated annealing, could eas- 
ily be implemented. 
3 The Objective Function 
Previous research approached the task of de- 
termining which rule combinations to allow ei- 
ther by a process of manual trial and error or 
by statistical measures based on a collection of 
positive examples only: if the original grammar 
produces more than a single parse of a sentence, 
only the "correct" parse was stored in the tree- 
bank. However, we here also have access to all 
incorrect parses assigned by the original gram- 
mar. This in turn means that we do not need 
to estimate ambiguity through some correlated 
statistical indicator, since we can measure it di- 
rectly simply by checking which parse trees are 
licensed by every new candidate grammar G. 
There are many possible ways of combining the 
counts of correct and incorrect parses in a suit- 
able objective function. For the sake of sire- 
plicity we opted for a linear combination. How- 
ever, simply maximizing correct parses and min- 
imizing incorrect ones would most likely lead to 
overfitting. In fact, a grammar with one large 
flat rule for each correct parse in the treebank 
would achieve a very high score during training, 
but most likely perform poorly on unseen data. 
A way to avoid overfitting consists in penalizing 
large grammars by introducing an appropriate 
term in the linear combination. The objective 
8 
A C A-oc  A_oc 
B "-'~" D A "-'P" E F C A ---~" E F C 
B-----P- E F A~ 
B ~ G Specialization 
Downward unfolding 
of A -> B C on "B" 
A C j 
B -----~ A D B "'~"B C C "'D"E B C 
B '''~ A C ""-~E B C 
C '''~ E A Specialization 
Upward unfolding 
of A -> B C 
Figure 1: Schematic examples of upward and downward unfolding of rules. 
function Score was thus formulated as follows: 
Scorea = Acorr Corra -- Aine InCa - ~size Sizea 
where Corr and Inc are the number of correct 
and incorrect parses allowed by the grammar, 
and Size is the size of the grammar measured 
as the total number of symbol occurrences in the 
right-hand sides of its rules. Acorr and Ainc are 
weights controlling the pruning aggressiveness: 
their ratio Acorr/Ainc intuitively corresponds to 
the number of incorrect trees a specialization 
must disallow for each disallowed correct tree, 
if the specialization is to lead to an improve- 
ment over the current grammar. The lower this 
ratio is, the more aggressive the pruning is. The 
relative value of ;~size with respect to the other 
As also controls the depth to which the search is 
conducted: most specializations result in an in- 
crease in grammar size, which tends to be more 
and more significant as the number and the size 
of rules grows; a larger Asize thus has the effect 
of stopping the search earlier. Note that only 
two of the three weights are independent. 
4 Treebank Representation 
A folded treebank is a representation of a set 
of parse trees which allows an immediate as- 
sessment of the effects of inhibiting specific rule 
combinations. It is based on the idea of "fold- 
ing" each tree onto a representation of the gram- 
mar itself. Any phrase-structure grammar can 
be represented as a concatenation/or graph -- 
a directed bipartite multigraph with an or-node 
for each symbol and a concatenation-node for 
each rule in the grammar. The present de- 
scription covers context-free grammars, but the 
scheme can easily be extended to any unifica- 
tion grammar with a context-free backbone by 
replacing symbol eqality with unification. 
Given a grammar G = (E, EN, S,R), we can 
define a relation ~ and a (partial) function ~?n: 
• ~/~ C EN × R s.t. (A,r / E r/~ iffA = lhs(r) 
• r~R : R × Af + ~ E s.t. rlR(r , i) = X iff 
rhs(r) = fiX% \]/3\[ = i - 1 
Figure 2 shows the correspondence between a 
simple grammar fragment and its concatena- 
tion/or graph. 
Each tree can be represented by folding it 
onto the concatenation/or graph representing 
the grammar it was derived with, or, in other 
words, by appropriately annotating the graph 
itself. If N is the set of nodes of a parse tree 
obtained using grammar G, the corresponding 
folded tree is a partial function f 
f:NxN---~RxAf + 
such that f(n,n') = (r, k) implies that node n 
was expanded using rule r, and that node n' is 
its k th daughter in the tree (Fig.3). In the fol- 
lowing, we will use the inverse image of (r, k) un- 
der f, which we denote ¢(r, k) : f-l((r, k)) : 
{(n,n')lf(n,n' ) : (r,k)} C g × N. This can in 
turn be seen as a partial function 
¢ : R × Jkf+ ~ 2 N×N 
Disallowing the expansion of the k th element 
in the right-hand side of rule r by means of rule 
r' (assuming symbols match, i.e., (r/R(r, k), r') E 
rl2 ) results in suppressing a tree where: 
3n, n', n" E N, k' E .hf + \[<n, e ¢(r, k) A <n', n"> • ¢(r', k')\] 
This check can be performed very efficiently 
once the tree is represented by the ¢ function, 
i.e., once it is folded, as all this requires is to 
compare the entries for (r, k) and (r', k') 2 with a 
procedure linear in the size of the entries them- 
selves. If we used a more traditional represen- 
tation, the same check would require traversing 
2In fact, it suffices to check the entries for (r', 1). 
A 
• nO 
nl ~ B 
• n2 • 
c ~ n4 
• f n3 
c On5 
• n7 • n8 f e 
¢(rl, 1) = {<no,n1), <rt4,rt5>} 
¢(rl, 2) = {<no, n2), (n4, n6)} 
¢(r2, 1) = 0 
¢(r2, 2) = 0 
¢(r3, 1) = {(n2, n3)} 
¢(r3, 2) = (<n2, n4>} 
¢(r4, I) = {<n6, nT>} 
¢(r4, 2) -~ {(n6, n8) } 
A 
r 1 --% ~ r2 
O^'~ " O 
C x\ e. ~ ~;, .e 
r ~ ~o" 
~;! '~r4 
Figure 3: A tree and its folded representation. 
the whole tree. The worst-case complexity is 
still linear in the size of the tree, but in prac- 
tice, the number of nodes expanded using any 
given rule is much smaller than the total num- 
ber of nodes. 
Whenever a specialization is performed, all 
folded trees that are no longer licensed are 
removed; the concatenation/or graph for the 
grammar is updated to reflect the introduction 
and the elimination of rules; and the annota- 
tions on the affected edges are appropriately re- 
combined and distributed. If the performed spe- 
cialization is X C_ DRU(r, k) ~ {r}, 3 then the 
concatenation/or graph is updated as follows 
= nux\{~) 
~r = ~r U {(lhs(r),~)l~ E X} \ {(lhs(r),r)} 
~(r", i) = { VR(r",i), r" ¢ X 
~\]R(r,i), r u E X,i < k 
= VR(r',i -- k + 1), r" = aA(r, k, r') E X, 
k<i<k+m-1, 
~R(r, i -- m + 1), r" = RA(r, k, r') E X, 
i>k+m-1, 
where m = arity(r') = Irhs(r')l is the number 
of right-hand-side symbols of rule r ~. For each 
tree that is not eliminated as a consequence of 
3If X = DRU(r, k) = {r}, then no update is needed. 
the specialization we have 
~(~",~) = 
¢(r",i), 
if r" ¢~ X, r" ¢ r' 
¢(~",/) k (<n',n">13n\[<n,n'> e ¢(~, k)\]), 
if r" = r ~ 
{(n, n">13n', n", k'\[<n, n'> ~ ¢(~, k) 
A(n',n") E ¢(r',k') A (n,n'") E ¢(r,i)\]} 
if r" = RA(r,k,r ~) E X,i < k 
{(n, n")13n'\[(n, n') E ¢(r, k)A 
(n', n") E ¢(r', i -- k + 1)\]} 
ifr"=RA(r,k,r ~) EX, k<i<k+m-1, 
{(n,n")13n',n",k'\[(n,n' ) E ¢(r,k)A 
<n', n"> e ¢(~', k')A 
(n, n'") E ¢(r, i - m + 1)\]} 
if r" = RA(r, k, r ~) E X, i > k + m - 1, 
where again m = arity(r'). These updates can 
be implemented efficiently, requiring neither a 
traversal of the tree nor of the grammar. 
5 Experimental Results 
We specialized a broad-coverage LFG grammar 
for French on a corpus of technical documen- 
tation using the method described above. The 
treebank consisted of 960 sentences which were 
all known to be covered by the original gram- 
mar. For each sentence, all the trees returned by 
10 
R -- {rl, r2, r3, r4} s.t. 
rl: A-~cB 
r2: A --+ e A 
ra: B-+fA 
r4: B-arE 
~2 = {(A, rl), (A, r2}, {B, r3), (B, r4)} 
C 
fiR: (rl, 1) ~ c 
(rl, 2) ~ B 
(r2,1) -+ e 
(r2, 2) ~ A 
(r3, 1) ~ f 
(r3, 2) ~ A 
(r4,1) I 
@4, 2) --~ e 
A 
rl r 2 ® ® 
• • / • e / 
r 3 \/r4 
f 
Figure 2: A tiny grammar and the correspond- 
ing concatenation/or graph. 
the original grammar were available, together 
with a manually assigned indication of which 
was the correct one. The environment used, 
the Xerox Linguistic Environment (Kaplan and 
Maxwell, 1996) implements a version of opti- 
mality theory where parses are assigned "opti- 
mality marks" based on a number of criteria, 
and are ranked according to these marks. The 
set of parses with the best marks are called the 
optimal parses for a sentence. The correct parse 
was also an optimal parse for 913 out of 960 sen- 
tences. Given this, the specialization was aimed 
at reducing the number of optimal parses per 
sentence. 
We ran a series of ten-fold cross-validation ex- 
periments; the results are summarized in the ta- 
ble in Fig.4. The first line contains values for 
the original grammar. The second line contains 
measures for the first-order pruning grammar, 
i.e., the grammar with all and only those rules 
actually used in correct parses in the training 
set, with no combination inhibited. Lines 3 and 
4 list results for fully specialized grammars. Re- 
sults in the third line were obtained with a value 
for ~corr equal to 15 times the value of Ainc in 
the objective function: in other words, during 
training we were willing to lose a correct parse 
only if at least 15 incorrect parses were canceled 
as well. Results in the fourth line were obtained 
when this ratio was reduced to 10. The average 
number of parses per sentence is reported in the 
first column, whereas the second lists the av- 
erage number of optimal parses. Coverage was 
measured as the fraction of sentences which still 
receive the correct parse with the specialized 
grammar. To assess the trade off between cover- 
age and ambiguity reduction, we computed the 
F-score 4 considering only optimal parses when 
computing precision. This measure should not 
be confused with the F-score on labelled brack- 
eting reported for many stochastic parsers; here 
precision and recall concern perfect matching of 
whole trees. Recall is the same as coverage: the 
ratio between the number of correct parses pro- 
duced by the specialized grammar and the to- 
tal number of correct parses (equalling the total 
number of sentences in the test set). Precision is 
the ratio between the number of correct parses 
produced by the specialized grammar and the 
total number of parses produced by the same 
grammar. The fourth column lists values for the 
F-score when equal weight is given to precision 
and recall. Intuitively, however, in many cases 
missing the correct parse is more of a problem 
than returning spurious parses, so we also com- 
puted the F-score with a much larger emphasis 
on recall, i.e., with a = 0.1. The corresponding 
values are listed in the last column. 
The average number of parses per sentence, 
both optimal and non-optimal, decreases signif- 
icantly as more and more aggressive specializa- 
tion.is carried out, and consequently, more cov- 
erage is lost. The most aggressive form of spe- 
4The F-score is the harmonic mean of recall and pre- 
cision, where precision is weighted a and recall 1 - a. 
11 
Avg.p/s Avg. o.p./s. Coverage (%) F(a-- 0.5) F(c~-- 0.1) 
orig. 1941 4.69 100 35.15 73.05 
f.o.pruning 184 3.38 89 40.64 71.89 
Acorr=lh)~inc 82 2.23 86 53.25 76.58 
~corr----lO)kin c 63 2.03 82.5 54.46 74.80 
Figure 4: Results of the 
cialization gives the highest F-score for c~ = 0.5, 
whereas somewhat more conservative parame- 
ter settings lead to a better F-score when re- 
call is valued more. A speedup of a factor 4 
is achieved already by first-order pruning and 
remains approximately the same after further 
specialization. 
6 Conclusions 
Broad-coverage grammars tend to be highly am- 
biguous, which may constitute a serious prob- 
lem when using them for natural- pro- 
cessing. Corpus-independent compilation tech- 
niques, although useful for increasing efficiency, 
do little in terms of reducing ambiguity. 
In this paper we proposed a corpus-based 
technique for specializing a grammar on a do- 
main for which a treebank exists containing all 
trees returned for each sentence. This tech- 
nique, which builds extensively on previous 
work on explanation-based learning for NLP, 
consists in casting the problem as an optimiza- 
tion problem in the space of all possible spe- 
cializations of the original grammar. As initial 
candidate grammar, the first-order pruning of 
the original grammar is considered. Candidate 
successor grammars are obtained through the 
downward rule unfolding and specialization op- 
erator, that has the desirable property of never 
causing previously unseen parses to become 
available for sentences in the training set. Can- 
didate grammars are then assessed according to 
an objecting function combining grammar am- 
biguity and coverage, adapted to avoid overfit- 
ting. In order to ensure efficient computability 
of the objective function, the treebank is pre- 
viously folded onto the grammar itself. Exper- 
imental results using a broad-coverage lexical- 
functional grammar of French show that the 
technique allows effectively trading coverage for 
ambiguity reduction. Moreover, the parameters 
of the objective function can be used to control 
the trade off. 
specialization experiments. 
Acknowledgements 
We would like to thank the members of the 
MLTT group at the Xerox Research Centre Eu- 
rope in Grenoble, France, and the three anony- 
mous reviewers for valuable discussions and 
comments. This research was funded by the Eu- 
ropean TMR network Learning Computational 
Grammars. 

References 
Hiyan Alshawi, editor. 1992. The Core Language 
Engine. MIT Press. 
M. Butt, T.H. King, M.E. Nifio, and F. Segond. 
1999. A Grammar Writer's Cookbook. CSLI Pub- 
lications, Stanford, CA. 
Nicola Cancedda and Christer Samuelsson. 2000. 
Experiments with corpus-based lfg specialization. 
In Proceedings o/ the NAACL-ANLP 2000 Con- 
ference, Seattle, WA. 
Ronald Kaplan and John T. Maxwell. 1996. 
LFG grammar writer's workbench. Technical 
report, Xerox PARC. Available on-line as 
ftp://ftp.parc.xerox.com/pub/lfg/lfgmanual.ps. 
Manny Rayner and David Carter. 1996. Fast pars- 
ing using pruning and grammar specialization. In 
Proceedings of the ACL-96, Santa Cruz, CA. 
Manny Rayner. 1988. Applying explanation-based 
generalization to natural- processing. 
In Proceedings o/ the International Con/erence 
on Fifth Generation Computer Systems, Tokyo, 
Japan. 
Christer Samuelsson and Manny Rayner. 1991. 
Quantitative evaluation of explanation-based 
learning as an optimization tool for a large-scale 
natural  system. In Proceedings o/the 
IJCAI-91, Sydney, Australia. 
Christer Samuelsson. 1994. Grammar specialization 
through entropy thresholds. In Proceedings of the 
ACL-94, Las Cruces, New Mexico. Available as 
cmp-lg/9405022. 
Khalil Sima'an. 1999. Learning Efficient Dis- 
ambiguation. Ph.D. thesis, Institute for Logic, 
Language and Computation, Amsterdam, The 
Netherlands. 
