Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 811–818,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Comparison of Alternative Parse Tre Paths 
for Labeling Semantic Roles 
 
Reid Swanson and Andrew S. Gordon 
Institute for Creative Technologies 
University of Southern California 
13274 Fiji Way, Marina del Rey, CA 90292 USA 
swansonr@ict.usc.edu, gordon@ict.usc.edu 
 
  
 
Abstract 
The integration of sophisticated infer-
ence-based techniques into natural lan-
guage processing applications first re-
quires a reliable method of encoding the 
predicate-argument structure of the pro-
positional content of text. Recent statisti-
cal approaches to automated predicate-
argument annotation have utilized parse 
tree paths as predictive features, which 
encode the path between a verb predicate 
and a node in the parse tree that governs 
its argument. In this paper, we explore a 
number of alternatives for how these 
parse tree paths are encoded, focusing on 
the difference between automatically 
generated constituency parses and de-
pendency parses. After describing five al-
ternatives for encoding parse tree paths, 
we investigate how well each can be 
aligned with the argument substrings in 
annotated text corpora, their relative pre-
cision and recall performance, and their 
comparative learning curves. Results in-
dicate that constituency parsers produce 
parse tree paths that can more easily be 
aligned to argument substrings, perform 
better in precision and recall, and have 
more favorable learning curves than 
those produced by a dependency parser. 
1 Introduction 
A persistent goal of natural language processing 
research has been the automated transformation 
of natural language texts into representations that 
unambiguously encode their propositional 
content in formal notation. Increasingly, first-
order predicate calculus representations of 
textual meaning have been used in natural 
lanugage processing applications that involve 
automated inference. For example, Moldovan et 
al. (203) demonstrate how predicate-argument 
formulations of questions and candidate answer 
sentences are unified using logical inference in a 
top-performing question-answering application. 
The importance of robust techniques for 
predicate-argument transformation has motivated 
the development of large-scale text corpora with 
predicate-argument annotations such as 
PropBank (Palmer et al., 205) and FrameNet 
(Baker et al., 198). These corpora typically take 
a pragmatic approach to the predicate-argument 
representations of sentences, where predicates 
correspond to single word trigers in the surface 
form of the sentence (typically verb lemmas), 
and arguments can be identified as substrings of 
the sentence. 
Along with the development of annotated 
corpora, researchers have developed new 
techniques for automatically identifying the 
arguments of predications by labeling text 
segments in sentences with semantic roles. Both 
Gildea & Jurafsky (202) and Palmer et al. 
(205) describe statistical labeling algorithms 
that achieve high accuracy in assigning semantic 
role labels to appropropriate constituents in a 
parse tree of a sentence. Each of these efforts 
employed the use of parse tree paths as 
predictive features, encoding the series of up and 
down transitions through a parse tree to move 
from the node of the verb (predicate) to the 
governing node of the constituent (argument). 
Palmer et al. (205) demonstrate that utilizing 
the gold-standard parse trees of the Penn tree-
bank (Marcus et al., 193) to encode parse tree 
paths yields significantly better labeling accuracy 
than when using an automatic syntactical parser, 
namely that of Colins (199). 
811
Parse tree paths (between verbs and arguments 
that fil semantic roles) are particularly interest-
ing because they symbolically encode the rela-
tionship between the syntactic and semantic as-
pects of verbs, and are potentially generalized 
acros other verbs within the same class (Levin, 
193). However, the encoding of individual 
parse tree paths for predicates is wholy depend-
ent on the characteristics of the parse tree of a 
sentence, for which competing approaches could 
be taken. 
The research effort described in this paper fur-
ther explores the role of parse tree paths in iden-
tifying the argument structure of verb-based 
predications. We are particularly interested in 
exploring alternatives to the constituency parses 
that were used in previous research, including 
parsing approaches that employ dependency 
grammars. Specifically, our aim is to answer four 
important questions: 
1. How can parse tree paths be encoded when 
employing different automated constituency 
parsers, i.e. Charniak (200), Klein & Maning 
(203), or a dependency parser (Lin, 198)? 
2. Given that each of these alternatives creates 
a different formulation of the parse tree of a sen-
tence, which of them encodes branches that are 
easiest to align with substrings that have been 
annotated with semantic role information? 
3. What is the relative precision and recall per-
formance of parse tree paths formulated using 
these alternative automated parsing techniques, 
and do the results vary depending on argument 
type? 
4. How many examples of parse tree paths are 
necessary to provide as training examples in or-
der to achieve high labeling accuracy when em-
ploying each of these parsing alternatives? 
Each of these four questions is addressed in 
the four subsequent sections of this paper, fol-
lowed by a discusion of the implications of our 
findings and directions for future work. 
2 Alternative Parse Tre Paths 
Parse tree paths were introduced by Gildea & 
Jurafsky (202) as descriptive features of the 
syntactic relationship between predicates and 
arguments in the parse tree of a sentence. Predi-
cates are typically assumed to be specific target 
words (usually verbs), and arguments are as-
sumed to be a span of words in the sentence that 
are governed by a single node in the parse tree. A 
parse tree path can be described as a sequence of 
transitions up and down a parse tree from the 
target word to the governing node, as exempli-
fied in Figure 1. 
The encoding of the parse tree path feature is 
dependent on the syntactic representation that is 
produced by the parser. This, in turn, is depend-
ant on the training corpus used to build the 
parser, and the conditioning factors in its prob-
ability model. As result, encodings of parse tree 
paths can vary greatly depending on the parser 
that is used, yielding parse tree paths that vary in 
their ability to generalize acros sentences. 
In this paper we explore the characteristics of 
parse tree paths with respect to different ap-
proaches to automated parsing. We were particu-
larly interested in comparing traditional constitu-
ency parsing (as exemplified in Figure 1) with 
dependency parsing, specifically the Minipar 
system built by Lin (198). Minipar is increas-
ingly being used in semantics-based nlp applica-
tions (e.g. Pantel & Lin, 202). Dependency 
parse trees differ from constituency parses in that 
they represent sentence structures as a set of de-
pendency relationships between words, typed 
asymmetric binary relationships between head 
words and modifying words. Figure 2 depicts the 
output of Minipar on an example sentence, where 
each node is a word or an empty node along with 
the word lemma, its part of speech, and the 
relationship type to its governing node. 
Our motivation for exploring the use of Mini-
par in for the creation of parse tree paths can be 
seen by comparing Figure 1 and Figure 2, where 
 
Figure 1: An example parse tree path from 
the predicate ate to the argument NP He, rep-
resented as VB↑VP↑S↓NP. 
 
 
 
Figure 2. An example dependency parse, 
with a parse tree path from the predicate ate 
to the argument He. 
812
the Minipar path is both shorter and simpler for 
the same predicate-argument relationship, and 
could be encoded in various ways that take ad-
vantage of the additional semantic and lexical 
information that is provided. 
To compare traditional constituency parsing 
with dependency parsing, we evaluated the accu-
racy of argument labeling using parse tree paths 
generated by two leading constituency parsers 
and three variations of parse tree paths generated 
by Minipar, as folows: 
 
Charniak: We used the Charniak parser 
(200) to extract parse tree paths similar to those 
found in Palmer et al. (205), with some slight 
modifications. In cases where the last node in the 
path was a non-branching pre-terminal, we added 
the lexical information to the path node. In addi-
tion, our paths led to the lowest governing node, 
rather than the highest. For example, the parse 
tree path for the argument in Figure 1 would be 
encoded as: 
VB↑VP↑S↓NP↓PRP:he 
 
Stanford: We also used the Stanford parser 
developed by Klein & Manning (203), with the 
same path encoding as the Charniak parser. 
 
Minipar A: We used three variations of parse 
tree path encodings based on Lin’s dependency 
parser, Minipar (198). Minipar A is the first and 
most restrictive path encoding, where each is 
annotated with the entire information output by 
Minpar at each node. A typical path might be: 
ate:eat,V,i↓He:he,N,s 
 
Minipar B: A second parse tree path encoding 
was generated from Minipar parses that relaxes 
some of the constraints used in Minpar A. In-
stead of using all the information contained at a 
node, in Minipar B we only encode a path with 
its part of speech and relational information. For 
example: 
V,i↓N,s 
 
Minipar C: As the converse to Minipar A we 
also tried one other Minipar encoding. As in 
Minipar A, we annotated the path with all the 
information output, but instead of doing a direct 
string comparison during our search, we consid-
ered two paths matching when there was a match 
between either the word, the stem, the part of 
speech, or the relation. For example, the folow-
ing two parse tree paths would be considered a 
match, as both include the relation i. 
ate:eat,V,i↓He:he,N,s 
was:be,VBE,i↓He:he,N,s 
 
We explored other combinations of depend-
ency relation information for Minipar-derived 
parse tree paths, including the use of the deep 
relations. However, results obtained using these 
other combinations were not notably different 
from those of the three base cases listed above, 
and are not included in the evaluation results re-
ported in this paper. 
3 Aligning arguments to parse tres 
nodes in a training / testing corpus 
We began our investigation by creating a training 
and testing corpus of 40 sentences each contain-
ing an inflection of one of four target verbs (10 
each), namely believe, think, give, and receive. 
These sentences were selected at random from 
the 194-07 section of the New York Times gi-
gaword corpus from the Linguistic Data Consor-
tium. These four verbs were chosen because of 
the synonymy among the first two, and the re-
flexivity of the second two, and because all four 
have straightforward argument structures when 
viewed as predicates, as folows: 
 
predicate: believe 
arg0: the believer 
arg1: the thing that is believed 
 
predicate: think 
arg0: the thinker 
arg1: the thing that is thought 
 
predicate: give 
arg0: the giver 
arg1: the thing that is given 
arg2: the receiver 
 
predicate: receive 
arg0: the receiver 
arg1: the thing that is received 
arg2: the giver 
 
This corpus of sentences was then annotated 
with semantic role information by the authors of 
this paper. Al annotations were made by assign-
ing start and stop locations for each argument in 
the unparsed text of the sentence. After an initial 
pilot annotation study, the folowing annotation 
policy was adopted to overcome common dis-
agreements: (1) When the argument is a noun 
and it is part of a definite description then in-
813
clude the entire definite description. (2) Do not 
include complementizers such as ‘that’ in ‘be-
lieve that’ in an argument. (3) Do include prepo-
sitions such as ‘in’ in ‘believe in’. (4) When in 
doubt, assume phrases attach locally. Using this 
policy, an agreement of 92.8% was achieved 
among annotators for the set of start and stop 
locations for arguments. Examples of semantic 
role annotations in our corpus for each of the 
four predicates are as folows:  
1. [
Arg0
Those who excavated the site in 1907] 
believe [
Arg1
 it once stod two or three stories 
high.] 
2. Gus is in god shape and [
Arg0
 I] think [
Arg1
 
he's happy as a bear.] 
3. If successful, [
Arg0
 he] wil give [
Arg1
 the 
funds] to [
Arg2
 his Vietnamese family.] 
4. [
Arg0
 The Bosnian Serbs] have received [
Arg1
 
military and economic suport] from [
Arg2
 Ser-
bia.] 
The next step was to parse the corpus of 40 
sentences using each of three automated parsing 
systems (Charniak, Stanford, and Minipar), and 
align each of the annotated arguments with its 
closest matching branch in the resulting parse 
trees. Given the differences in the parsing models 
used by these three systems, each yield parse tree 
nodes that govern different spans of text in the 
sentence. Often there exists no parse tree node 
that governs a span of text that exactly matches 
the span of an argument in the annotated corpus. 
Accordingly, it was necessary to identify the 
closest match posible for each of the three pars-
ing systems in order to encode parse tree paths 
for each. We developed a uniform policy that 
would facilitate a fair comparison between pars-
ing techniques. Our approach was to identify a 
single node in a given parse tree that governed a 
string of text with the most overlap with the text 
of the annotated argument. Each of the parsing 
methods tokenizes the input string differently, so 
in order to simplify the selection of the govern-
ing node with the most overlap, we made this 
selection based on lowest minimum edit distance 
(Levenshtein distance). 
Al three of these different parsing algorithms 
produced single governing nodes that overlapped 
well with the human-annotated corpus. However, 
it appeared that the two constituency parsers pro-
duced governing nodes that were more closely 
aligned, based on minimum edit distance. The 
Charniak parser aligned best with the annotated 
text, with an average of 2.40 characters for the 
lowest minimum edit distance (standard de-
viation = 8.64). The Stanford parser performed 
slightly worse (average = 2.67, standard devia-
tion = 8.86), while distances were nearly two 
times larger for Minipar (average = 4.73, 
standard deviation = 10.4). 
In each case, the most overlapping parse tree 
node was treated as correct for training and test-
ing purposes. 
4 Comparative Performance Evaluation 
In order to evaluate the comparative performance 
of the parse tree paths for each of the five encod-
ings, we divided the corpus in to equal-sized 
training and test sets (50 training and 50 test ex-
amples for each of the four predicates). We then 
constructed a system that identified the parse tree 
paths for each of the 10 arguments in the training 
sets, and applied them to the sentences in each 
corresponding test sets. When applying the 50 
training parse tree paths to any one of the 50 test 
sentences for a given predicate-argument pair, a 
set of zero or more candidate answer nodes were 
returned. For the purpose of calculating precision 
and recall scores, credit was given when the cor-
rect answer appeared in this set. Precision scores 
were calculated as the number of correct answers 
found divided by the number of all candidate 
answer nodes returned. Recall scores were calcu-
lated as the number of correct answers found di-
vided by the total number of correct answers 
posible. F-scores were calculated as the equally-
weighted harmonic mean of precision and recall. 
Our calculation of recall scores represents the 
best-posible performance of systems using only 
these types of parse-tree paths. This level of per-
formance could be obtained if a system could 
always select the correct answer from the set of 
candidates returned. However, it is also informa-
tive to estimate the performance that could be 
achieved by randomly selecting among the can-
didate answers, representing a lower-bound on 
performance. Accordingly, we computed an ad-
justed recall score that awarded only fractional 
credit in cases where more than one candidate 
answer was returned (one divided by the set 
size). Adjusted recall is the sum of all of these 
adjusted credits divided by the total number of 
correct answers posible. 
Figure 3 summarizes the comparative recall, 
precision, f-score, and adjusted recall perform-
ance for each of the five parse tree path formula-
tions. The Charniak parser achieved the highest 
overall scores (precision=.49, recall=.68, f-
score=.57, adjusted recall=.48), folowed closely 
814
by the Stanford parser (precision=.47, recall=.67, 
f-score=.5, adjusted recall=.48). 
Our expectation was that the short, semanti-
cally descriptive parse tree paths produced by 
Minipar would yield the highest performance. 
However, these results indicate the oposite; the 
constituency parsers produce the most accurate 
parse tree paths. Only Minipar C offers beter 
recall (0.71) than the constituency parsers, but at 
the expense of extremely low precision. Minipar 
A offers excellent precision (0.62), but with ex-
tremely low recall. Minipar B provides a balance 
between recall and precision performance, but 
falls short of being competitive with the parse 
tree paths generated by the two constituency 
parsers, with an f-score of .4. 
We utilized the Sign Test in order to deter-
mine the statistical significance of these differ-
ences. Rank orderings between pairs of systems 
were determined based on the adjusted credit that 
each system achieved for each test sentence. Sig-
nificant differences were found between the per-
formance of every system (p<0.05), with the ex-
ception of the Charniak and Stanford parsers. 
Interestingly, by comparing weighted values for 
each test example, Minipar C more frequently 
scores higher than Minipar A, even though the 
sum of these scores favors Minipar A. 
In addition to overall performance, we were 
interested in determining whether performance 
varied depending on the type of the argument 
that is being labeled. In assigning labels to argu-
ments in the corpus, we folowed the general 
principles set out by Palmer et al. (205) for la-
beling arguments arg0, arg1 and arg2. Acros 
each of our four predicates, arg0 is the agent of 
the predication (e.g. the person that has the belief 
or is doing the giving), and arg1 is the thing that 
is acted upon by the agent (e.g. the thing that is 
believed or the thing that is given). Arg2 is used 
only for the predications based on the verbs give 
and receive, where it is used to indicate the other 
party of the action. 
Our interest was in determining whether these 
five approaches yielded different results depend-
ing on the semantic type of the argument. Fig-
ure 4 presents the f-scores for each of these en-
codings acros each argument type. 
Results indicate that the Charniak and Stan-
ford parsers continue to produce parse tree paths 
that outperform each of the Minipar-based ap-
proaches. In each approach argument 0 is the 
easiest to identify. Minipar A retains the general 
trends of Charniak and Stanford, with argument 
 
Figure 3. Precision, recall, f-scores, and adjusted recall for five parse tree path types 
 
Figure 4. Comparative f-scores for arguments 0, 1, and 2 for five parse tree path types 
815
1 easier to identify than argument 2, while Mini-
par B and C show the reverse. The highest f-
scores for argument 0 were achieved Stanford 
(f=.65), while Charniak achieved the highest 
scores for argument 1 (f=.5) and argument 2 
(f=.49). 
5 Learning Curve Comparisons 
The creation of large-scale text corpora with syn-
tactic and/or semantic annotations is difficult, 
expensive, and time consuming. The PropBank 
effort has shown that producing this type of cor-
pora is considerably easier once syntactic analy-
sis has been done, but substantial effort and re-
sources are stil required. Better estimates of total 
costs could be made if it was known exactly how 
many annotations are necessary to achieve ac-
ceptable levels of performance. Accordingly, we 
investigated the learning curves of precision, re-
call, f-score, and adjusted recall achieved using 
the five different parse tree path encodings. 
For each encoding approach, learning curves 
were created by applying successively larger 
subsets of the training parse tree paths to each of 
the items in the corresponding test set. Precision, 
recall, f-scores, and adjusted recall were com-
puted as described in the previous section, and 
identical subsets of sentences were used acros 
parsers, in one-sentence increments. Individual 
learning curves for each of the five approaches 
are given in Figures 5, 6, 7, 8, and 9. Figure 10 
presents a comparison of the f-score learning 
curves for all five of the approaches. 
In each approach, the precision scores slowly 
degrade as more training examples are provided, 
due to the addition of new parse tree paths that 
yield additional candidate answers. Conversely, 
the recall scores of each system show their great-
est gains early, and then slowly improve with the 
addition of more parse tree paths. In each ap-
proach, the recall scores (estimating best-case 
performance) have the same general shape as the 
adjusted recall scores (estimating the lower-
bound performance). The divergence between 
these two scores increases with the addition of 
more training examples, and is more pronounced 
in systems employing parse tree paths with less 
specific node information. The comparative f-
score curves presented in Figure 10 indicate that 
Minipar B is competitive with Charniak and 
Stanford when only a small number of training 
examples is available. There is some evidence 
here that the performance of Minipar A would 
continue to improve with the addition of more 
training data, sugesting that this approach might 
be well-suited for applications where lots of 
training data is available. 
6 Discusion 
Anotated corpora of linguistic phenomena en-
able many new natural language processing ap-
plications and provide new means for tackling 
difficult research problems. Just as the Penn 
Treebank offers the posibility of developing 
systems capable of accurate syntactic parsing, 
corpora of semantic role annotations open up 
new posibilities for rich textual understanding 
and integrated inference. 
In this paper, we compared five encodings of 
parse tree paths based on two constituency pars-
ers and a dependency parser. Despite our expec-
tations that the semantic richness of dependency 
parses would yield paths that outperformed the 
others, we discovered that parse tree paths from 
Charniak’s constituency parser performed the 
best overall. In applications where either preci-
sion or recall is the only concern, then Minipar-
derived parse tree paths would yield the best re-
sults. We also found that the performance of all 
of these systems varied acros different argument 
types.  
 In contrast to the performance results reported 
by Palmer et al. (205) and Gildea & Jurafsky 
(202), our evaluation was based solely on parse 
tree path features. Even so, we were able to ob-
tain reasonable levels of performance without the 
use of additional features or stochastic methods. 
Learning curves indicate that the greatest gains 
in performance can be garnered from the first 10 
or so training examples. This result has implica-
tions for the development of large-scale corpora 
of semantically annotated text. Developers 
should distribute their effort in order to maxi-
mize the number of predicate-argument pairs 
with at least 10 annotations. 
An automated semantic role labeling system 
could be constructed using only the parse tree 
path features described in this paper, with esti-
mated performance between our recall scores and 
our adjusted recall scores. There are several ways 
to improve on the random selection approach 
used in the adjusted recall calculation. For exam-
ple, one could simply select the candidate answer 
with the most frequent parse tree path. 
The results presented in this paper help inform 
the design of future automated semantic role la-
beling systems that improve on the best-
performing systems available today (Gildea &  
816
 
 
 
 
 
Figure 5. Charniak learning curves 
 
 
 
 
Figure 6. Stanford learning curves 
 
 
 
 
Figure 7. Minipar A learning curves 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 8. Minipar B learning curves 
 
 
 
 
Figure 9. Minipar C learning curves 
 
 
 
 
Figure 10. Comparative F-score curves 
 
 
817
Jurafsky, 202; Moschiti et al., 205). We found 
that different parse tree paths encode different 
types of linguistic information, and exhibit dif-
ferent characteristics in the tradeoff between pre-
cision and recall. The best approaches in future 
systems wil intelligently capitalize on these dif-
ferences in the face of varying amounts of train-
ing data. 
In our own future work, we are particularly in-
terested in exploring the regularities that exist 
among parse tree paths for different predicates. 
By identifying these regularities, we believe that 
we wil be able to significantly reduce the total 
number of annotations necessary to develop lexi-
cal resources that have broad coverage over natu-
ral language. 
Acknowledgments 
The project or effort depicted was sponsored by 
the U. S. Army Research, Development, and En-
gineering Command (RDECOM). The content or 
information does not necessarily reflect the posi-
tion or the policy of the Government, and no of-
ficial endorsement should be inferred. 
References 
Baker, Colin, Charles J. Filmore and John B. Lowe. 
198. The Berkeley FrameNet Project, In Proced-
ings of COLING-ACL, Montreal. 
Charniak, Eugene. 200. A maximum-entropy-
inspired parser, In Procedings NACL. 
Colins, Michael. 199. Head-Driven Statistical Mod-
els for Natural Language Parsing, PhD thesis, Uni-
versity of Pensylvania. 
Gildea, Daniel and Daniel Jurafsky. 202. Automatic 
labeling of semantic roles, Computational Linguis-
tics, 28(3):245-28. 
Klein, Dan and Christopher Maning. 203. Acurate 
Unlexicalized Parsing, In Procedings of the 41st 
Anual Meting of the Asociation for Computa-
tional Linguistics, 423-430. 
Levin, Beth. 193. English Verb Clases and Alterna-
tions: A Preliminary Investigation. Chicago, IL: 
University of Chicago Pres. 
Lin, Dekang. 198. Dependency-Based Evaluation of 
MINIPAR, In Procedings of the Workshop on the 
Evaluation of Parsing Systems, First International 
Conference on Language Resources and Evaluation, 
Granada, Spain. 
Marcus, Mitchel P., Beatrice Santorini and Mary An 
arcinkiewicz. 193. Building a Large Anotated 
Corpus of English: The Pen Trebank, Computa-
tional Linguistics, 19(35):313-30 
Moldovan, Dan I., Christine Clark, Sanda M. Hara-
bagiu & Steven J. Maiorano. 203. COGEX: A 
Logic Prover for Question Answering, HLT-
NACL. 
Moschiti, A., Giuglea, A., Copola, B. & Basili, R. 
205. Hierarchical Semantic Role Labeling. In 
Procedings of the 9th Conference on Computa-
tional Natural Language Learning (CoNL 205 
shared task), An Arbor(MI), USA. 
Palmer, Martha, Dan Gildea and Paul Kingsbury. 
205. The Proposition Bank: An Anotated Corpus 
of Semantic Roles, Computational Linguistics. 
Pantel, Patrick and Dekang Lin. 202. Document 
clustering with comites, In Procedings of 
SIGIRO2, Tampere, Finland. 
 
818
