The University of Amsterdam at Senseval-3:
Semantic Roles and Logic Forms
David Ahn Sisay Fissaha Valentin Jijkoun Maarten de Rijke
Informatics Institute, University of Amsterdam
Kruislaan 403
1098 SJ Amsterdam
The Netherlands
{ahn,sfissaha,jijkoun,mdr}@science.uva.nl
Abstract
We describe our participation in two of the tasks or-
ganized within Senseval-3: Automatic Labeling of
Semantic Roles and Identification of Logic Forms
in English.
1 Introduction
This year (2004), Senseval, a well-established fo-
rum for the evaluation and comparison of word
sense disambiguation (WSD) systems, introduced
two tasks aimed at building semantic representa-
tions of natural language sentences. One task, Auto-
matic Labeling of Semantic Roles (SR), takes as its
theoretical foundation Frame Semantics (Fillmore,
1977) and uses FrameNet (Johnson et al., 2003) as
a data resource for evaluation and system develop-
ment. The definition of the task is simple: given
a natural language sentence and a target word in
the sentence, find other fragments (continuous word
sequences) of the sentence that correspond to ele-
ments of the semantic frame, that is, that serve as
arguments of the predicate introduced by the target
word.
For this task, the systems receive a sentence, a
target word, and a semantic frame (one target word
may belong to multiple frames; hence, for real-
world applications, a preliminary WSD step might
be needed to select an appropriate frame). The out-
put of a system is a list of frame elements, with their
names and character positions in the sentence. The
evaluation of the SR task is based on precision and
recall. For this year’s task, the organizers chose 40
frames from FrameNet 1.1, with 32,560 annnotated
sentences, 8,002 of which formed the test set.
The second task, Identification of Logic Forms
in English (LF), is based on the LF formalism de-
scribed in (Rus, 2002). The LF formalism is a sim-
ple logical form language for natural language se-
mantics with only predicates and variables; there
is no quantification or negation, and atomic predi-
cations are implicitly conjoined. Predicates corre-
spond directly to words and are composed of the
base form of the word, the part of speech tag, and a
sense number (corresponding to the WordNet sense
of the word as used). For the task, the system is
given sentences and must produce LFs. Word sense
disambiguation is not part of the task, so the pred-
icates need not specify WordNet senses. System
evaluation is based on precision and recall of pred-
icates and predicates together with all their argu-
ments as compared to a gold standard.
2 Syntactic Processing
For both tasks, SR and LF, the core of our systems
was the syntactic analysis module described in de-
tail in (Jijkoun and de Rijke, 2004). We only have
space here to give a short overview of the module.
Every sentence was part-of-speech tagged using
a maximum entropy tagger (Ratnaparkhi, 1996) and
parsed using a state-of-the-art wide coverage phrase
structure parser (Collins, 1999). Both the tagger and
the parser are trained on the Penn Treebank Wall
Street Journal Corpus (WSJ in the rest of this paper)
and thus produce structures similar to those in the
Penn Treebank. Unfortunately, the parser does not
deliver some of the information available in WSJ
that is potentially useful for our two applications:
Penn functional tags (e.g., subject, temporal, closely
related, logical subject in passive) and non-local de-
pendencies (e.g., subject and object control, argu-
ment extraction in relative clauses). Our syntactic
module tries to compensate for this and make this
information explicit in the resulting syntactic analy-
ses.
As a first step, we converted phrase trees pro-
duced by the parser to dependency structures, by
detecting heads of constituents and then propagat-
ing the lexical head information up the syntactic
tree, similarly to (Collins, 1999). The resulting de-
pendency structures were labeled with dependency
labels derived from corresponding Penn phrase la-
bels: e.g., a verb phrase (VP) modified by a prepo-
sitional phrase (PP) resulted in a dependency with
label ‘VP|PP’.
Then, the information available in the WSJ (func-
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
VP
to seek NP
seats
VP
planned
S
directors
this month
      NP
     NP  S
planned
directors
VP|S
S|NP
S|NP
month
 this
NP|DT to
seek
seats
VP|NPVP|TO
planned
directors
VP|SS|NP−SBJ
S|NP−TMP
S|NP−SBJ
month
 this
NP|DT to
seek
seats
VP|NPVP|TO
(a) (b) (c)
Figure 1: Stages of the syntactic processing: (a) the parser’s output, (b) the result of conversion to a depen-
dency structure, (c) final output of our syntactic module
tional tags, non-local dependencies) was added to
dependency structures using Memory-Based Learn-
ing (Daelemans et al., 2003): we trained the learner
to change dependency labels, or add new nodes or
arcs to dependency structures. Trained and tested
on WSJ, our system achieves state-of-the-art perfor-
mance for recovery of Penn functional tags and non-
local dependencies (Jijkoun and de Rijke, 2004).
Figure 1 shows three stages of the syntactic anal-
ysis of the sentence Directors this month planned to
seek seats (a simplified actual sentence from WSJ):
(a) the phrase structure tree produced by Collins’
parser, (b) the phrase structure tree converted to a
dependency structure and (c) the transformed de-
pendency structure with added functional tags and a
non-local dependency—the final output of our syn-
tactic module. Dependencies are shown as arcs
from heads to dependents.
3 Automatic Labeling of Semantic Roles
For the SR task, we applied a method very similar to
the one used in (Jijkoun and de Rijke, 2004) for re-
covering syntactic structures and somewhat similar
to the first method for automatic semantic role iden-
tification described in (Gildea and Jurafsky, 2002).
Essentially, our method consists of extracting possi-
ble syntactic patterns (paths in syntactic dependency
structures), introducing semantic relations from a
training corpus, and then using a machine learn-
ing classifier to predict which syntactic paths cor-
respond to which frame elements.
Our main assumption was that frame elements,
as annotated in FrameNet, correspond directly to
constituents (constituents being complete subtrees
of dependency structures). Similarly to (Gildea and
Jurafsky, 2002), our own evaluation showed that
about 15% of frame elements in FrameNet 1.1 do
not correspond to constituents, even when applying
some straighforward heuristics (see below) to com-
pensate for this mismatch. This observation puts an
upper bound of around 85% on the accuracy of our
system (with strict evaluation, i.e., if frame element
boundaries must match the gold standard exactly).
Note, though, that these 15% of “erroneous” con-
stituents also include parsing errors.
Since the core of our SR system operates on
words, constituents, and dependencies, two im-
portant steps are the conversion of FrameNet el-
ements (continuous sequences of characters) into
head words of constituents, and vice versa. The con-
version of FrameNet elements is straightforward:
we take the head of a frame element to be the word
that dominates the most words of this element in
the dependency graph of the sentence. In the other
direction, when converting a subgraph of a depen-
dency graph dominated by a word w into a contin-
uous sequence of words, we take all (i.e., not only
immediate) dependents of w, ignoring non-local de-
pendencies, unless w is the target word of the sen-
tence, in which case we take the word w alone. This
latter heuristic helps us to handle cases when a noun
target word is a semantic argument of itself. Sev-
eral other simple heristics were also found helpful:
e.g., if the result of the conversion of a constituent to
a word sequence contains the target word, we take
only the words to the right of the target word.
With this conversion between frame elements and
constituents, the rest of our system only needs to
operate on words and labeled dependencies.
3.1 Training: the major steps
First, we extract from the training corpus
(dependency-parsed FrameNet sentences, with
words marked as targets and frame elements) all
shortest undirected paths in dependency graphs that
connect target words with their semantic arguments.
In this way, we collect all “interesting” syntactic
paths from the training corpus.
In the second step, for all extracted syntactic
paths and again for all training sentences, we extract
all occurences of the paths (i.e., paths, starting from
a target word, that actually exist in the dependency
graph), recording for each such occurrence whether
it connects a target word to one of its semantic ar-
guments. For performance reasons, we consider for
each target word only syntactic paths extracted from
sentences annotated with respect to the same frame,
and we ignore all paths of length more than 3.
For every extracted occurrence, we record the
features describing the occurrence of a path in more
detail: the frame name, the path itself, the words
along the path (including the target word and the
possible head of a frame element—first and last
node of the path, respectively), their POS tags and
semantic classes. For nouns, the semantic class
of a word is defined as the hypernym of the first
sense of the noun in WordNet, one of 19 manu-
ally selected terms (animal, person, social group,
clothes, feeling, property, phenomenon, etc.) For
lexical adverbs and prepositions, the semantic class
is one of the 6 clusters obtained automatically using
the K-mean clustering algorithm on data extracted
from FrameNet. Examples of the clusters are:
(abruptly, ironically, slowly, . . . ), (above, beneath,
inside, . . . ), (entirely, enough, better, . . . ). The list
of WordNet hypernyms and the number of clusters
were chosen experimentally. We also added features
describing the subcategorization frame of the tar-
get word; this information is straightforwardly ex-
tracted from the dependency graph. In total, the sys-
tem used 22 features.
The set of path occurrences obtained in the sec-
ond step, with all the extracted features, is a pool of
positive and negative examples of whether certain
syntactic patterns correspond to any semantic argu-
ments. The pool is used as an instance base to train
TiMBL, a memory-based learner (Daelemans et al.,
2003), to predict whether the endpoint of a syntac-
tic path starting at a target word corresponds to a
semantic argument, and if so, what its name is.
We chose TiMBL for this task because we had
previously found that it deals successfully with
complex feature spaces and data sparseness (in our
case, in the presence of many lexical features) (Ji-
jkoun and de Rijke, 2004). Moreover, TiMBL is
very flexible and implements many variants of the
basic k-nearest neighbor algorithm. We found that
tuning various parameters (the number of neigh-
bors, weighting and voting schemes) made substan-
tial differences in the performance of our system.
3.2 Applying the system
Once the training is complete, the system can be
applied to new sentences (with the indicated target
word and its frame) as follows. A sentence is parsed
and its dependency structure is built, as described in
Section 2. All occurences of “interesting” syntac-
tic paths are extracted, along with their features as
described above. The resulting feature vectors are
fed to TiMBL to determine whether the endpoints
of the syntactic paths correspond to semantic argu-
ments of the target word. For the path occurences
classified positively, the constituents of their end-
points are converted to continuous word sequences,
as described earlier; in this case the system has de-
tected a frame element.
3.3 Results
During the development of our system, we used
only the 24,558 sentences from FrameNet set aside
for training by the SR task organizers. To tune the
system, this corpus was randomly split into training
and development sets (70% and 30%, resp.), evenly
for all target words. The official test set (8002 sen-
tences) was used only once to produce the submitted
run, with the whole training set (24,558 sentences)
used for training.
We submitted one run of the system (with iden-
tification of both element boundaries and element
names). Our official scores are: precision 86.9%,
recall 75.2% and overlap 84.7%. Our own evalua-
tion of the submitted run with the strict measures,
i.e., an element is considered correct only if both its
name and boundaries match the gold standard, gave
precision 73.5% and recall 63.6%.
4 Logic Forms
4.1 Method
For the LF task, it was straightforward to turn de-
pendency structures into LFs. Since the LF for-
malism does not attempt to represent the more sub-
tle aspects of semantics, such as quantification, in-
tensionality, modality, or temporality (Rus, 2002),
the primary information encoded in a LF is based
on argument structure, which is already well cap-
tured by the dependency parses. Our LF genera-
tor traverses the dependency structure, turning POS-
tagged lexical items into LF predicates, creating ref-
erential variables for nouns and verbs, and using
dependency labels to order the arguments for each
predicate. We make one change to the dependency
graphs originally produced by the parser. Instead of
taking coordinators, such as and, to modify the con-
stituents they coordinate, we take the coordinated
constituents to be arguments of the coordinator.
Our LF generator builds a labeled directed graph
from a dependency structure and traverses this
graph depth-first. In general, a well-formed depen-
dency graph has exactly one root node, which cor-
responds to the main verb of the sentence. Sen-
tences with multiple independent clauses may have
one root per clause. The generator begins traversing
the graph at one of these root nodes; if there is more
than one, it completes traversal of the subgraph con-
nected to the first node before going on to the next
node.
The first step in processing a node—producing an
LF predicate from the node’s lexical item—is taken
care of in the graph-building stage. We use a base
form dictionary to get the base form of the lexical
item and a simple mapping of Penn Treebank tags
into ‘n’, ‘v’, ‘a’, and ‘r’ to get the suffix. For words
that are not tagged as nouns, verbs, adjectives, or
adverbs, the LF predicate is simply the word itself.
As the graph is traversed, the processing of a node
depends on its type. The greatest amount of pro-
cessing is required for a node corresponding to a
verb. First, a fresh referential variable is generated
as the event argument of the verbal predication. The
out-edges are then searched for nodes to process.
Since the order of arguments in an LF predication
is important and some sentence constitutents are ig-
nored for the purposes of LF, the out-edges are cho-
sen in order by label: first particles (‘VP|PRT’),
then arguments (‘S|NP-SBJ’, ‘VP|NP’, etc.), and
finally adjuncts. We attempt to follow the argu-
ment order implicit in the description of LF given
in (Rus, 2002), and as the formalism requires, we
ignore auxiliary verbs and negation. The processing
of each of these arguments or adjuncts is handled re-
cursively and returns a set of predications. For mod-
ifiers, the event variable also has to be passed down.
For referential arguments and adjuncts, a referen-
tial variable also is returned to serve as an argument
for the verb’s LF predicate. Once all the arguments
and adjuncts have been processed, a new predica-
tion is generated, in which the verb’s LF predicate
is applied to the event variable and the recursively
generated referential variables. This new predica-
tion, along with the recursively generated ones, is
returned.
The processing of a nominal node proceeds sim-
ilarly. A fresh referential variable is generated—
since determiners are ignored in the LF formalism,
it is simply assumed that all noun phrases corre-
spond to a (possibly composite) individual. Out-
edges are examined for modifiers and recursively
processed. Both the referential variable and the set
of new predications are returned. Noun compounds
introduce some additional complexity; each modi-
fying noun introduces two additional variables, one
for the modifying noun and one for composite indi-
vidual realizing the compound. This latter variable
then replaces the referential variable for the head
noun.
Processing of other types of nodes proceeds in a
similar fashion. For modifiers such as adjectives,
adverbs, and prepositional phrases, a variable (cor-
responding to the individual or event being modi-
fied) is passed in, and the LF predicate of the node
is applied to this variable, rather than to a fresh
variable. In the case of prepositional phrases, the
predicate is applied to this variable and to the vari-
able corresponding to the object of the preposition,
which must be processed, as well. The latter vari-
able is then returned along with the new predica-
tions. For other modifiers, just the predications are
returned.
4.2 Development and results
The rules for handling dependency labels were writ-
ten by hand. Of the roughly 1100 dependency la-
bels that the parser assigns (see Section 2), our sys-
tem handles 45 labels, all of which fall within the
most frequent 135 labels. About 50 of these 135
labels are dependencies that can be ignored in the
generation of LFs (labels involving punctuation, de-
terminers, auxiliary verbs, etc.); of the remaining
85 labels, the 45 labels handled were chosen to pro-
vide reasonable coverage over the sample corpus
provided by the task organizers. Extending the sys-
tem is straightforward; to handle a dependency label
linking two node types, a rule matching the label
and invoking the dependent node handler is added
to the head node handler.
On the sample corpus of 50 sentences to which
our system was tuned, predicate identification, com-
pared to the provided LFs, including POS-tags, was
performed with 89.1% precision and 87.1% recall.
Argument identification was performed with 78.9%
precision and 77.4% recall. On the test corpus of
300 sentences, our official results, which exclude
POS-tags, were 82.0% precision and 78.4% recall
for predicate identification and 73.0% precision and
69.1% recall for argument identification.
We did not get the gold standard for the test cor-
pus in time to perform error analysis for our official
submission, but we did examine the errors in the
LFs we generated for the trial corpus. Most could
be traced to errors in the dependency parses, which
is unsurprising, since the generation of LFs from de-
pendency parses is relatively straightforward. A few
errors resulted from the fact that our system does not
try to identify multi-word compounds.
Some discrepancies between our LFs and the LFs
provided for the trial corpus arose from apparent
inconsistencies in the provided LFs. Verbs with
particles were a particular problem. Sometimes,
as in sentences 12 and 13 of the trial corpus, a
verb-particle combination such as look forward to
is treated as a single predicate (look forward to); in
other cases, such as in sentence 35, the verb and its
particle (go out) are treated as separate predicates.
Other inconsistencies in the provided LFs include
missing arguments (direct object in sentence 24),
and verbs not reduced to base form (felt, saw, and
found in sentences 34, 48, 50).
5 Conclusions
Our main finding during the development of the sys-
tems for the two Senseval tasks was that semantic
relations are indeed very close to syntactic depen-
dencies. Using deep dependency structures helped
to keep the manual rules for the LF task simple
and made the learning for the SR task easier. Also
we found that memory-based learning can be effi-
ciently applied to complex, highly structured prob-
lems such as the identification of semantic roles.
Our future work includes more accurate fine-
tuning of the learner for the SR task, extending the
coverage of the LF generator, and experimenting
with the generated LFs for question answering.
6 Acknowledgments
Ahn and De Rijke were supported by a grant from
the Netherlands Organization for Scientific Re-
search (NWO) under project number 612.066.302.
Fissaha, Jijkoun, and De Rijke were supported by a
grant from NWO under project number 220-80-001.
De Rijke was also supported by grants from NWO,
under project numbers 365-20-005, 612.069.006,
612.000.106, and 612.000.207.

References
M. Collins. 1999. Head-Driven Statistical Models
for Natural Language Parsing. Ph.D. thesis, Uni-
versity of Pennsylvania.
W. Daelemans, J. Zavrel, K. van der Sloot, and
A. van den Bosch, 2003. TiMBL: Tilburg Mem-
ory Based Learner, version 5.0, Reference Guide.
ILK Technical Report 03-10. Available from
http://ilk.kub.nl/downloads/pub/papers/ilk0310.pdf.
C. J. Fillmore. 1977. The need for a frame semantics in
linguistics. Statistical Methods in Linguistics, 12:5–
29.
D. Gildea and D. Jurafsky. 2002. Automatic label-
ing of semantic roles. Computational Linguistics,
28(3):245–288.
V. Jijkoun and M. de Rijke. 2004. Enriching the output
of a parser using memory-based learning. In Proceed-
ings of ACL 2004.
C. Johnson, M. Petruck, C. Baker, M. Ellsworth, J. Rup-
penhofer, and C. Fillmore. 2003. Framenet: Theory
and practice. http://www.icsi.berkeley.edu/ framenet.
A. Ratnaparkhi. 1996. A maximum entropy part-of-
speech tagger. In Proceedings of the Empirical Meth-
ods in Natural Language Processing Conference.
V. Rus. 2002. Logic Form for WordNet Glosses. Ph.D.
thesis, Southern Methodist University.
