Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 193–196, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Semantic Role Labeling as Sequential Tagging
Llu´ıs M`arquez, Pere Comas, Jes´us Gim´enez and Neus Catal`a
TALP Research Centre
Technical University of Catalonia (UPC)
{lluism,pcomas,jgimenez,ncatala}@lsi.upc.edu
Abstract
In this paper we present a semantic role
labeling system submitted to the CoNLL-
2005 shared task. The system makes
use of partial and full syntactic informa-
tion and converts the task into a sequen-
tial BIO-tagging. As a result, the label-
ing architecture is very simple . Build-
ing on a state-of-the-art set of features, a
binary classifier for each label is trained
using AdaBoost with fixed depth decision
trees. The final system, which combines
the outputs of two base systems performed
F1=76.59 on the official test set. Addi-
tionally, we provide results comparing the
system when using partial vs. full parsing
input information.
1 Goals and System Architecture
The goal of our work is twofold. On the one hand,
we want to test whether it is possible to implement
a competitive SRL system by reducing the task to a
sequential tagging. On the other hand, we want to
investigate the effect of replacing partial parsing in-
formation by full parsing. For that, we built two dif-
ferent individual systems with a shared sequential
strategy but using UPC chunks-clauses, and Char-
niak’s parses, respectively. We will refer to those
systems as PPUPC and FPCHA, hereinafter.
Both partial and full parsing annotations provided
as input information are of hierarchical nature. Our
system navigates through these syntactic structures
in order to select a subset of constituents organized
sequentially (i.e., non embedding). Propositions are
treated independently, that is, each target verb gen-
erates a sequence of tokens to be annotated. We call
this pre-processing step sequentialization.
The sequential tokens are selected by exploring
the sentence spans or regions defined by the clause
boundaries1. The top-most syntactic constituents
falling inside these regions are selected as tokens.
Note that this strategy is independent of the input
syntactic annotation explored, provided it contains
clause boundaries. It happens that, in the case of
full parses, this node selection strategy is equivalent
to the pruning process defined by Xue and Palmer
(2004), which selects sibling nodes along the path of
ancestors from the verb predicate to the root of the
tree2. Due to this pruning stage, the upper-bound re-
call figures are 95.67% for PPUPC and 90.32% for
FPCHA. These values give F1 performance upper
bounds of 97.79 and 94.91, respectively, assuming
perfect predictors (100% precision).
The nodes selected are labeled with B-I-O tags
depending if they are at the beginning, inside, or out-
side of a verb argument. There is a total of 37 argu-
ment types, which amount to 37*2+1=75 labels.
Regarding the learning algorithm, we used gen-
eralized AdaBoost with real-valued weak classifiers,
which constructs an ensemble of decision trees of
fixed depth (Schapire and Singer, 1999). We con-
sidered a one-vs-all decomposition into binary prob-
1Regions to the right of the target verb corresponding to an-
cestor clauses are omitted in the case of partial parsing.
2With the unique exception of the exploration inside sibling
PP constituents proposed by (Xue and Palmer, 2004).
193
lems to address multi-class classification.
AdaBoost binary classifiers are used for labeling
test sequences in a left-to-right tagging scheme us-
ing a recurrent sliding window approach with infor-
mation about the tag assigned to the preceding to-
ken. This tagging module ensures some basic con-
straints, e.g., BIO correct structure, arguments do
not cross clause boundaries nor base chunk bound-
aries, A0-A5 arguments not present in PropBank
frames for a certain verb are not allowed, etc. We
also tried beam search on top of the classifiers’ pre-
dictions to find the sequence of labels with highest
sentence-level probability (as a summation of indi-
vidual predictions). But the results did not improve
the basic greedy tagging.
Regarding feature representation, we used all
input information sources, with the exception of
verb senses and Collins’ parser. We did not con-
tribute with significantly original features. Instead,
we borrowed most of them from the existing liter-
ature (Gildea and Jurafsky, 2002; Carreras et al.,
2004; Xue and Palmer, 2004). Broadly speaking, we
considered features belonging to four categories3:
(1) On the verb predicate:
• Form; Lemma; POS tag; Chunk type and Type of
verb phrase in which verb is included: single-word or
multi-word; Verb voice: active, passive, copulative, in-
finitive, or progressive; Binary flag indicating if the verb
is a start/end of a clause.
• Subcategorization, i.e., the phrase structure rule expand-
ing the verb parent node.
(2) On the focus constituent:
• Type; Head: extracted using common head-word rules;
if the first element is a PP chunk, then the head of the first
NP is extracted;
• First and last words and POS tags of the constituent.
• POS sequence: if it is less than 5 tags long; 2/3/4-grams
of the POS sequence.
• Bag-of-words of nouns, adjectives, and adverbs in the
constituent.
• TOP sequence: sequence of types of the top-most syn-
tactic elements in the constituent (if it is less than 5 ele-
ments long); in the case of full parsing this corresponds to
the right-hand side of the rule expanding the constituent
node; 2/3/4-grams of the TOP sequence.
• Governing category as described in (Gildea and Juraf-
sky, 2002).
3Features extracted from partial parsing and Named Enti-
ties are common to PPUPC and FPCHA models, while features
coming from Charniak parse trees are implemented exclusively
in the FPCHA model.
• NamedEnt, indicating if the constituent embeds or
strictly-matches a named entity along with its type.
• TMP, indicating if the constituent embeds or strictly
matches a temporal keyword (extracted from AM-TMP ar-
guments of the training set).
(3) Context of the focus constituent:
• Previous and following words and POS tags of the con-
stituent.
• The same features characterizing focus constituents are
extracted for the two previous and following tokens,
provided they are inside the clause boundaries of the cod-
ified region.
(4) Relation between predicate and constituent:
• Relative position; Distance in words and chunks; Level
of embedding with respect to the constituent: in number
of clauses.
• Constituent path as described in (Gildea and Jurafsky,
2002); All 3/4/5-grams of path constituents beginning at
the verb predicate or ending at the constituent.
• Partial parsing path as described in (Carreras et al.,
2004); All 3/4/5-grams of path elements beginning at the
verb predicate or ending at the constituent.
• Syntactic frame as described by Xue and Palmer (2004)
2 Experimental Setting and Results
We trained the classification models using the com-
plete training set (sections from 02 to 21). Once con-
verted into one sequence per target predicate, the re-
sulting set amounts 1,049,049 training examples in
the PPUPC model and 828,811 training examples in
the FPCHA model. The average number of labels per
argument is 2.071 and 1.068, respectively. This fact
makes “I” labels very rare in the FPCHA model.
When running AdaBoost, we selected as weak
rules decision trees of fixed depth 4 (i.e., each branch
may represent a conjunction of at most 4 basic fea-
tures) and trained a classification model per label for
up to 2,000 rounds.
We applied some simplifications to keep training
times and memory requirements inside admissible
bounds. First, we discarded all the argument la-
bels that occur very infrequently and trained only
the 41 most frequent labels in the case of PPUPC
and the 35 most frequent in the case of FPCHA.
The remaining labels where joined in a new label
“other” in training and converted into “O” when-
ever the SRL system assigns a “other” label dur-
ing testing. Second, we performed a simple fre-
quency filtering by discarding those features occur-
ring less than 15 times in the training set. As an
194
exception, the frequency threshold for the features
referring to the verb predicate was set to 3. The final
number of features we worked with is 105,175 in the
case of PPUPC and 80,742 in the case of FPCHA.
Training with these very large data and feature
sets becomes an issue. Fortunately, we could split
the computation among six machines in a Linux
cluster. Using our current implementation combin-
ing Perl and C++ we could train the complete mod-
els in about 2 days using memory requirements be-
tween 1.5GB and 2GB. Testing with the ensembles
of 2,000 decision trees per label is also not very effi-
cient, though the resulting speed is admissible, e.g.,
the development set is tagged in about 30 minutes
using a standard PC.
The overall results obtained by our individual
PPUPC and FPCHA SRL systems are presented in ta-
ble 1, with the best results in boldface. As expected,
the FPCHA system significantly outperformed the
PPUPC system, though the results of the later can
be considered competitive. This fact is against the
belief, expressed as one of the conclusions of the
CoNLL-2004 shared task, that full-parsing systems
are about 10 F1 points over partial-parsing systems.
In this case, we obtain a performance difference of
2.18 points in favor of FPCHA.
Apart from resulting performance, there are addi-
tional advantages when using the FPCHA approach.
Due to the coarser granularity of sequence tokens,
FPCHA sequences are shorter. There are 21% less
training examples and a much lower quantity of “I”
tags to predict (the mapping between syntactic con-
stituents and arguments is mostly one-to-one). As
a consequence, FPCHA classifiers train faster with
less memory requirements, and achieve competitive
results (near the optimal) with much less rounds of
boosting. See figure 1. Also related to the token
granularity, the number of completely correct out-
puts is 4.13 points higher in FPCHA, showing that
the resulting labelings are structurally better than
those of PPUPC.
Interestingly, the PPUPC and FPCHA systems
make quite different argument predictions. For in-
stance, FPCHA is better at recognizing A0 and A1
arguments since parse constituents corresponding to
these arguments tend to be mostly correct. Compar-
atively, PPUPC is better at recognizing A2-A4 argu-
ments since they are further from the verb predicate
 64
 66
 68
 70
 72
 74
 76
 78
 200  400  600  800  1000  1200  1400  1600  1800  2000
Overall F1
Number of rounds
PP-upc
FP-cha
PP best
FP-cha best
Figure 1: Overall F1 performance of individual sys-
tems on the development set with respect to the num-
ber of learning rounds
Perfect props Precision Recall Fβ=1
PPUPC 47.38% 76.86% 70.55% 73.57
FPCHA 51.51% 78.08% 73.54% 75.75
Combined 51.39% 78.39% 75.53% 76.93
Table 1: Overall results of the individual systems on
the development set.
and tend to accumulate more parsing errors, while
the fine granularity of the PPUPC sequences still al-
low to capture them4. Another interesting observa-
tion is that the precision of both systems is much
higher than the recall.
The previous two facts suggest that combining the
outputs of the two systems may lead to a significant
improvement. We experimented with a greedy com-
bination scheme for joining the maximum number of
arguments from both solutions in order to increase
coverage and, hopefully, recall. It proceeds depart-
ing from an empty solution by: First, adding all the
arguments from FPCHA in which this method per-
forms best; Second, adding all the arguments from
PPUPC in which this method performs best; and
Third, making another loop through the two meth-
ods adding the arguments not considered in the first
loop. At each step, we require that the added argu-
ments do not overlap/embed with arguments in the
current solution and also that they do not introduce
repetitions of A0-A5 arguments. The results on the
4As an example, the F1 performance of PPUPC on A0 and
A2 arguments is 79.79 and 65.10, respectively. The perfor-
mance of FPCHA on the same arguments is 84.03 and 62.36.
195
Precision Recall Fβ=1
Development 78.39% 75.53% 76.93
Test WSJ 79.55% 76.45% 77.97
Test Brown 70.79% 64.35% 67.42
Test WSJ+Brown 78.44% 74.83% 76.59
Test WSJ Precision Recall Fβ=1
Overall 79.55% 76.45% 77.97
A0 87.11% 86.28% 86.69
A1 79.60% 76.72% 78.13
A2 69.18% 67.75% 68.46
A3 76.38% 56.07% 64.67
A4 79.78% 69.61% 74.35
A5 0.00% 0.00% 0.00
AM-ADV 59.15% 52.37% 55.56
AM-CAU 73.68% 57.53% 64.62
AM-DIR 71.43% 35.29% 47.24
AM-DIS 77.14% 75.94% 76.54
AM-EXT 63.64% 43.75% 51.85
AM-LOC 62.74% 54.27% 58.20
AM-MNR 54.33% 52.91% 53.61
AM-MOD 96.16% 95.46% 95.81
AM-NEG 99.13% 98.70% 98.91
AM-PNC 53.49% 40.00% 45.77
AM-PRD 0.00% 0.00% 0.00
AM-REC 0.00% 0.00% 0.00
AM-TMP 77.68% 78.75% 78.21
R-A0 86.84% 88.39% 87.61
R-A1 75.32% 76.28% 75.80
R-A2 54.55% 37.50% 44.44
R-A3 0.00% 0.00% 0.00
R-A4 0.00% 0.00% 0.00
R-AM-ADV 0.00% 0.00% 0.00
R-AM-CAU 0.00% 0.00% 0.00
R-AM-EXT 0.00% 0.00% 0.00
R-AM-LOC 0.00% 0.00% 0.00
R-AM-MNR 0.00% 0.00% 0.00
R-AM-TMP 69.81% 71.15% 70.48
V 99.16% 99.16% 99.16
Table 2: Overall results (top) and detailed results on
the WSJ test (bottom).
development set (presented in table 1) confirm our
expectations, since a performance increase of 1.18
points over the best individual system was observed,
mainly caused by recall improvement. The final sys-
tem we presented at the shared task performs exactly
this solution merging procedure. When applied on
the WSJ test set, the combination scheme seems to
generalize well, since an improvement is observed
with respect to the development set. See the offi-
cial results of our system, which are presented in ta-
ble 2. Also from that table, it is worth noting that the
F1 performance drops by more than 9 points when
tested on the Brown test set, indicating that the re-
sults obtained on the WSJ corpora do not generalize
well to corpora with other genres. The study of the
sources of this lower performance deserves further
investigation, though we do not believe that it is at-
tributable to the greedy combination scheme.
3 Conclusions
We have presented a simple SRL system submit-
ted to the CoNLL-2005 shared task, which treats
the SRL problem as a sequence tagging task (us-
ing a BIO tagging scheme). Given the simplic-
ity of the approach, we believe that the results are
very good and competitive compared to the state-
of-the-art. We also provided a comparison between
two SRL systems sharing the same architecture, but
build on partial vs. full parsing, respectively. Al-
though the full parsing approach obtains better re-
sults and has some implementation advantages, the
partial parsing system shows also a quite competi-
tive performance. The results on the development
set differ in 2.18 points, but the outputs generated
by the two systems are significantly different. The
final system, which scored F1=76.59 in the official
test set, is a combination of both individual systems
aiming at increasing coverage and recall.
Acknowledgements
This research has been partially supported by the
European Commission (CHIL project, IP-506909).
Jes´us Gim´enez is a research fellow from the Span-
ish Ministry of Science and Technology (ALIADO
project, TIC2002-04447-C02). We would like to
thank also Xavier Carreras for providing us with
many software components and Mihai Surdeanu for
fruitful discussions on the problem and feature engi-
neering.

References
X. Carreras, L. M`arquez, and G. Chrupała. 2004. Hierarchical
recognition of propositional arguments with perceptrons. In
Proceedings of CoNLL-2004.
D. Gildea and D. Jurafsky. 2002. Automatic labeling of seman-
tic roles. Computational Linguistics, 28(3):245–288.
R. E. Schapire and Y. Singer. 1999. Improved Boosting Algo-
rithms Using Confidence-rated Predictions. Machine Learn-
ing, 37(3).
N. Xue and M. Palmer. 2004. Calibrating features for semantic
role labeling. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP).
