Grammatical Role Labeling with Integer Linear Programming
Manfred Klenner
Institute of Computational Linguistics
University of Zurich
klenner@cl.unizh.ch
Abstract
In this paper, we present a formalization
of grammatical role labeling within the
frameworkofInteger Linear Programming
(ILP). We focus on the integration of sub-
categorization information into the deci-
sion making process. We present a first
empirical evaluation that achieves compet-
itive precision and recall rates.
1 Introduction
An often stressed point is that the most widely
used classifiers such as Naive Bayes, HMM, and
Memory-based Learners are restricted to local de-
cisions only. With grammatical role labeling, for
example, there is no way to explicitly express
global constraints that, say, the verb“to give”must
have 3 arguments of a particular grammatical role.
Among the approaches to overcome this restric-
tion, i.e. that allow for global, theory based con-
straints, Integer Linear Programming (ILP) has
been applied to NLP (Punyakanok et al., 2004) .
Weapply ILPto the problem of grammatical re-
lation labeling, i.e. given two chunks.1 (e.g. a
verb and a np), what is the grammatical relation
between them (if there is any). We have trained a
maximum entropy classifier on vectors with mor-
phological, syntactic and positional information.
Its output is utilized as weights to the ILP com-
ponent which generates equations to solve the fol-
lowing problem: Given subcategorization frames
(expressed in functional roles, e.g. subject), and
given a sentence with verbs, a0 (auxiliary, modal,
finite, non-finite, ..), and chunks, a1 (a2a4a3 ,a3a5a3 ), label
all pairs (a0a7a6 a1a9a8a11a10a7a12 a0a7a6 a1 ) withagrammatical role2.
Inthispaper, wearepursuing twoempirical sce-
narios. The first is to collapse all subcategoriza-
1Currently, we use perfect chunks, that is, chunks stem-
ming from automatically flattening a treebank.
2Most of these pairs do not stand in a proper grammatical
relation, they get a null class assignment.
tion frames of a verb into a single one, comprising
all subcategorized roles of the verb but not nec-
essarily forming a valid subcategorization frame
of that verb at all. For example, the verb “to be-
lieve” subcategorizes for a subject and a preposi-
tional complement (“He believes in magic”) or for
a subject and a clausal complement (“She believes
that he is dreaming”), but there is no frame that
combines a subject, a prepositional object and a
clausal object. Nevertheless, the set of valid gram-
matical roles of a verb can serve as a filter operat-
ing upon the output of a statistical classifier. The
typical errors being made by classifiers with only
local decisions are: a constituent is assigned to a
grammatical role more than once and a grammat-
ical role (e.g. of a verb) is instantiated more than
once. The worst example in our tests was a verb
that receives from the maxent classifier two sub-
jects and three clausal objects. Here, such a role
filter will help to improve the results.
The second setting is to provide ILP with the
correct subcategorization frame of the verb. The
results of such an oracle setting define the upper
bound of the performance our ILP approach can
achieve. Future work will be to let ILP find the
optimal subcategorization frame given all frames
of a verb.
2 The ILP Specification
Integer Linear Programming (ILP) is the name of
a class of constraint satisfaction algorithms which
are restricted to a numerical representation of the
problem to be solved. The objective is to optimize
(minimize or maximize) the numerical solution of
linear equations (see the objective function in Fig.
1). The general form of an ILP specification is
given in Fig. 1 (here: maximization). The goal is
to maximize a a2 -ary function a13 , which is defined
as the sum of the variables a14a11a15a17a16a18a15 .
Assignment decisions (e.g. grammatical role la-
beling) can be modeled in the following way: a16a20a19
187
Objective Function:
a0a2a1a4a3
a13 a12 a16a6a5a8a7a8a9a8a9a8a9a10a7 a16
a19
a8a12a11a14a13 a14a15a5 a16a6a5a17a16a18a9a8a9a8a9a19a16 a14
a19
a16
a19
Constraints:
a20
a15a21a5 a16a6a5a17a16
a20
a15a23a22 a16a24a22a25a16a18a9a8a9a8a9a26a16
a20
a15
a19
a16
a19
a27a28
a29a31a30
a13a32
a33a35a34
a36a38a37
a15a39a7
a40
a13a31a41a26a7a8a9a8a9a8a9a42a7a44a43
a16a18a15 are variables, a14a5a15 ,
a37
a15 and
a20
a15a46a45 are constants.
Figure 1: ILP Specification
are binary class variables that indicate the (non-)
assignment of a constituent a47 a15 to the grammatical
function a48 a45 (e.g. subject) of a verb a49a51a50 . To rep-
resent this, three indices are needed. Thus, a16 is
a complex variable name, e.g. a48
a15a46a45
a50 . For the sake
of readability, weadd somemnemotechnical sugar
and use a48 a15a52a49a53a45a10a47a54a50 instead or a55a25a49a10a45a8a47a56a50 for a constituent
a47a56a50 being (or not) the subject a55 of verb a49a4a45 (a55 thus
is an instantiation of a48 a15 ) . If the value of such
a class variable a48 a15 a49 a45 a47a56a50 is set to 1 in the course
of the maximization task, the attachment was suc-
cessful, otherwise ( a48 a15a57a49a53a45a8a47a56a50a58a13a60a59 ) it failed. a14a5a15 from
Fig. 1 are weights that represent the impact of an
assignment (or a constraint); they provide an em-
pirically based numerical justification of the as-
signment (we don”t need the a20 a15a61a45 ). For example,
we represent the impact of a48a7a15a57a49a8a45a53a47a54a50 =1 by a62a64a63a66a65a68a67a70a69a72a71a74a73 .
These weights are derived from a maximum en-
tropy model trained on a treebank (see section 5).
a37 is used to set up numerical constraints. For ex-
ample that a constituent can only be the filler of
one grammatical role. The decision, which of the
class variables are to be “on” or “off” is based on
the weights and the constraints an overall solution
must obey to. ILP seeks to optimize the solution.
3 Formalization
We restrict our formalization to the following set
of grammatical functions: subject (a55 ), direct (i.e.
accusative) object (a75 ), indirect (i.e. dative) object
(a76 ), clausal complement (a1 ), prepositional com-
plement (a77 ), attributive (np or pp) attachment (a78 )
and adjunct (a79 ). The set of grammatical relations
of a verb (verb complements) is denoted with a48 , it
comprises a55 , a75 ,a76 , a1 and a77 .
The objective function is:
a43
a20a81a80
a11a82a79a83a16a84a78a60a16a86a85a87a16
a0 (1)
a79 represents the weighted sum of all adjunct at-
tachments. a78 is the weighted sum of all attributive
a88a89a88 (“the book in her hand ..”) and genitive
a90
a88
attachments (“die Frau desa91a72a92 a19 Professorsa91a35a92 a19 ” [the
wife of the professor]). a85 represents the weighted
sum of all unassigned objects.3 a0 is the weighted
sum of the case frame instantiations of all verbs in
the sentence. It is defined as follows:
a0
a13a31a93
a67
a92a39a94a96a95a70a97
a93
a98
a15
a98
a63a100a99a26a101a103a102
a65
a93
a71a74a104
a19a19a97a74a105a106a97a74a107
a93
a98
a45 a108
a63a109a67a44a65a106a71a106a69a111a110a112a48a113a49 a15a114a47a96a45 (2)
This sums up over all verbs. For each verb,
each grammatical role (a115a116a67a96a65 is the set of such
roles) is instantiated from the stock of all con-
stituents (a47a56a117 a2a119a118a10a120a96a118a53a121 , which includes all np and pp
constituents but also the verbs as potential heads
of clausal objects). a48a113a49a5a15a114a47a44a45 is a variable that in-
dicates the assignment of a constituent a47 a45 to the
grammatical function a48 of verb a49 a15 .
a108
a63a109a67a44a65a106a71a106a69 is the
weight of such an assignment. The (binary) value
of each a48a113a49 a15a114a47a44a45 is to be determined in the course
of the constraint satisfaction process, the weight is
taken from the maximum entropy model.
a78 isthefunction for weighted attributive attach-
ments:
a78a122a13a31a93
a71a70a104
a19a19a97a70a105a106a97
a93
a98
a15
a93
a71a70a104
a19a26a97a74a105a106a97
a93
a98
a45a12a123 a15a70a124
a125
a45a54a126
a62a100a127a119a71a70a65a128a71a106a69a112a110a112a78a129a47 a15a57a47a96a45 (3)
where a62a17a127a130a71a70a65a106a71a106a69 is the weight of an assignment
of constituent a47a131a45 to constituent a47 a15 and a78a58a47 a15a114a47a44a45 is a
binary variable indicating the classification deci-
sion whether a47a131a45 actually modifies a47 a15 . In contrast to
a47a56a117 a2a119a118a53a120a44a118
a121 ,
a47a56a117 a2a119a118a53a120a44a118 does not include verbs.
The function for weighted adjunct attachments,
a79 , is:
a79a132a13a31a93
a71a70a104
a19a26a97a74a105a106a97a131a133
a93
a98
a45
a93
a67
a92a74a94a44a95a70a97
a93
a98
a15
a62a100a134
a67a44a65a128a71a106a69
a110a112a79a2a49
a15
a47
a45 (4)
where a47a56a117 a2a119a118a10a120a44a118a136a135 is the set of a88a89a88 constituents of
the sentence. a62 a134 a67a44a65a106a71a52a69 is the weight given to a clas-
sification of a a88a89a88 as an adjunct of a clause with a49a11a15
as verbal head.
The function for the weighted assignment to the
null class, a85 , is:
a85a137a13a31a93
a71a74a104
a19a19a97a74a105a106a97
a107
a93
a98
a15
a108
a71a70a65a66a110a119a85a58a47 a15 (5)
This represents the impact of assigning a con-
stituent neither to a verb (as a complement) nor
3Not every set ofchunks can form avalid dependency tree
-a138 introduces robustness.
188
to another constituent (as an attributive modifier).
a85a89a47
a15
a13 a41 means that the constituent a47
a15 has got no
head (e.g. a finite verb as part of a sentential co-
ordination), although it might be the head of other
a47a44a45 .
The equations from 1 to 5 are devoted to the
maximization task, i.e. which constituent is at-
tached to which grammatical function and with
which impact. Of course, without any further re-
strictions, every constituent would get assigned to
every grammatical role - because there are no co-
occurrence restrictions. Exactly this would lead to
a maximal sum. In order to assure a valid distribu-
tion, restrictions have to be formulated, e.g. that a
grammatical role can have at most one filler object
and that a constituent can be at most the filler of
one grammatical role.
4 Constraints
A constituent a47 a45 must either be bound as an at-
tribute, an adjunct, a verb complement or by the
null class. This is to say that all class variables
with a47a131a45 sum up to exactly 1; a47a131a45 then is consumed.
a85a89a47a44a45a42a16
a98
a15
a98
a63
a48a113a49 a15a21a47a44a45a42a16
a98
a15
a78a129a47 a15a57a47a96a45a4a16
a98
a15
a79a24a49 a15a57a47a44a45a113a13a83a41a26a7
a1a3a2
(6)
Here,a2 isan index over all constituents and a48 is
one of the grammatical roles of verb a49 a15 (a48a5a4 a115a113a67a44a65 ).
No two constituents can be attached to each
other symmetrically (being head and modifier of
each other at the same time), i.e. a78 (among oth-
ers) is defined to be asymmetric.
a78a129a47 a15a57a47a44a45a112a16a84a78a58a47a44a45a53a47 a15
a30
a41a26a7
a1a6a2
a7
a40 (7)
Finally, we must restrict the number of filler
objects a grammatical role can have. Here, we
have to distinguish among our two settings. In
setting one (all case roles of all frames of a verb
are collapsed into a single set of case roles), we
can’t require all grammatical roles to be instanti-
ated (since we have an artificial case frame, not
necessarily aproper one). Thisis expressed as
a30
a41
in equation 8.
a71a70a104
a19a26a97a74a105a106a97
a107
a98
a45
a48a113a49 a15a21a47a44a45
a30
a41a26a7
a1
a40
a7a72a48a5a4 a115a113a67a44a65 (8)
In setting two (the actual case frame is given),
we require that every grammatical role a48 of the
verb a49 a15 (a48a7a4 a115a113a67a44a65 ) must be instantiated exactly
once:
a71a70a104
a19a26a97a74a105a106a97
a107
a98
a45
a48a113a49 a15a21a47a44a45 a13a31a41a26a7
a1
a40
a7a72a48a5a4 a115a113a67a44a65 (9)
5 The Weighting Scheme
Amaximum entropy model was used to fixa prob-
ability model that serves as the basis for the ILP
weights. The model was trained on the Tiger tree-
bank (Brants et al., 2002) with feature vectors
stemming from the following set of features: the
part of speech tags of the two candidate chunks,
the distance between them in phrases, the number
of verbs between them, the number of punctuation
marks between them, the person, case and num-
ber of the candidates, their heads, the direction of
the attachment (left or right) and a passive/active
voice flag.
The output of the maxent model is for each pair
of chunks (represented by their feature vectors) a
probability vector. Each entry in this probability
vector represents theprobability (usedasaweight)
that the two chunks are in a particular grammat-
ical relation (including the “non-grammatical re-
lation”, a90a86a48a116a115 ) . For example, the weight for an
adjunct assignment, a62
a134
a67a9a8a74a71a11a10 , of two chunks a49a103a41 (a
verb) and a47a13a12 (a a2a4a3 or a a3a5a3 ) is given by the cor-
responding entry in the probability vector of the
maximum entropy model. The vector also pro-
vides values for a subject assignment of these two
chunks etc.
6 Empirical Results
The overall precision of the maximum entropy
classifier is 87.46%. Since candidate pairs are
generated almost without restrictions, most pairs
do not realize a proper grammatical relation. In
the training set these examples are labeled with
the non-grammatical relation label a90 a48 a115 (which
is the basis of ILPs null class a85 ). Since maximum
entropy modeling seeks to sharpen the classifier
with respect to the most prominent class, a90 a48 a115
gets a strong bias. So things are getting worse, if
wefocus on the proper grammatical relations. The
precision then is low, namely 62.73%, the recall is
85.76%, the f-measure is 72.46 %. ILP improves
the precision by almost 20% (in the “all frames in
one setting” the precision is 81.31%).
We trained on 40,000 sentences, which gives
about 700,000 vectors (90% training, 10% test, in-
cluding negative and positive pairings). Our first
experiment was devoted to fix an upper bound for
the ILP approach: we selected from the set of sub-
categorization frames of averbthecorrect one(ac-
cording to the gold standard). The set of licenced
grammatical relations then is reduced to the cor-
189
rect subcategorized GR and the non-governable
GR a79 (adjunct) and a78 (attribute). The results are
given in Fig. 2 under Fa71a70a104 a94a96a94 (cf. section 3 for GR
shortcuts, e.g. a55 for subject).
Fa71a70a104
a94a96a94
F
a71a70a104a1a0a2a0
Prec Rec F-Mea Prec Rec F-Mea
a55 91.4 86.1 88.7 89.8 85.7 87.7
a75 90.4 83.3 86.7 78.6 79.7 79.1
a76 88.5 76.9 82.3 73.5 62.1 67.3
a77 79.3 73.7 76.4 75.6 43.6 55.9
a1 98.6 94.1 96.3 82.9 96.6 89.3
a79 76.7 75.6 76.1 74.2 78.9 76.5
a78 75.7 76.9 76.3 73.6 79.9 76.7
Figure 2: Correct Frame and Collapsed Frames
The results of the governable GR (a55 down to
a1 ) are quite good, only the results for preposi-
tional complements (a77 ) are low (the f-measure is
76.4%). From the 36509 grammatical relations,
37173 were found and 31680 were correct. Over-
all precision is 85.23%, recall is 86.77% and the
f-measure is 85.99%. The most dominant error
being made here is the coherent but wrong assign-
ment of constituents to grammatical roles (e.g. the
subject is taken to be object). This is not a prob-
lem with ILP or the subcategorization frames, but
one of the statistical model (and the feature vec-
tors). It does not discriminate well among alter-
natives. Any improvement of the statistical model
will push the precision of ILP.
The results of the second setting, i.e. to collapse
all grammatical roles of the verb frames to a sin-
gle role set (cf. Fig. 2, Fa71a70a104a1a0a3a0 ), are astonishingly
good. The f-measures comes close to the results
of (Buchholz, 1999). Overall precision is 79.99%,
recall 82.67% and f-measure is 81.31%. As ex-
pected, the values of the governable GR decrease
(e.g. recall for prepositional objects by 30.1%).
The third setting will be to let ILP choose
among all subcategorization frames of a verb
(there are up to 20 frames per verb). First experi-
ments have shown that the results are between the
a4
a71a70a104
a94 a94 and
a4
a71a70a104a1a0a2a0 results. The question then is, how
close can we come to the a4 a71a74a104
a94a96a94
upper bound.
7 Related Work
ILP has been applied to various NLP problems,
including semantic role labeling (Punyakanok et
al., 2004), extraction of predicates from parsetrees
(Klenner, 2005) and discourse ordering in genera-
tion (Althaus et al., 2004). (Roth and Yih, 2005)
discuss how to utilize ILP with Conditional Ran-
dom Fields.
Grammatical relation labeling has been coped
with in a couple of articles, e.g. (Buchholz,
1999). There, a cascaded model (of classifiers)
has been proposed (using various tools around
TIMBL). The f-measure (perfect test data) was
83.5%. However, the set of grammatical relations
differs from the one we use, which makes it diffi-
cult to compare the results.
8 Conclusion and Future Work
In this paper, we argue for the integration of top
down (theory based) information into NLP. One
kind of information that is well known but have
been used only in a data driven manner within
statistical approaches (e.g. the Collins parser) is
subcategorization information (or case frames). If
subcategorization information turns out to be use-
ful at all, it might become so only under the strict
control of a global constraint mechanism. We are
currently testing an ILP formalization where all
subcategorization frames of a verb are competing
witheachother. Thebenefits willbetohavethein-
stantiation not only of licensed grammatical roles
of a verb, but of a consistent and coherent instan-
tiation of a single case frame.
Acknowledgment. I would like to thank Markus Dreyer
for fruitful (“long distance”) discussions and a number of
(steadily improved) maximum entropy models. Also, the de-
tailed comments of the reviewers have been very helpful.
References
Ernst Althaus, Nikiforos Karamanis, and Alexander Koller.
2004. Computing Locally Coherent Discourses. Proceed-
ings of the ACL. 2004.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang
Lezius and George Smith. 2002. The TIGER Treebank.
Proceedings of the Workshop on Treebanks and Linguistic
Theories.
Sabine Buchholz, Jorn Veenstra and Walter Daelemans.
1999. Cascaded Grammatical Relation Assignment.
EMNLP-VLC’99, the Joint SIGDAT Conference on Em-
pirical Methods in NLP and Very Large Corpora.
Manfred Klenner. 2005. Extracting Predicate Structures
from Parse Trees. Proceedings of the RANLP 2005.
Vasin Punyakanok, Dan Roth, Wen-tau Yih, and Dave Zi-
mak. 2004. Role Labeling via Integer Linear Program-
ming Inference. Proceedings of the 20th COLING.
Dan Roth and Wen-tau Yih. 2005. ILP Inference for Condi-
tional Random Fields. Proceedings of the ICML, 2005.
190
