Assigning Function Tags to Parsed Text* 
Don Blaheta and Eugene Charniak 
{dpb, ec}@cs, brown, edu 
Department of Computer Science 
Box 1910 / 115 Waterman St.--4th floor 
Brown University 
Providence, RI 02912 
Abstract 
It is generally recognized that the common non- 
terminal labels for syntactic constituents (NP, 
VP, etc.) do not exhaust the syntactic and se- 
mantic information one would like about parts 
of a syntactic tree. For example, the Penn Tree- 
bank gives each constituent zero or more 'func- 
tion tags' indicating semantic roles and other 
related information not easily encapsulated in 
the simple constituent labels. We present a sta- 
tistical algorithm for assigning these function 
tags that, on text already parsed to a simple- 
label level, achieves an F-measure of 87%, which 
rises to 99% when considering 'no tag' as a valid 
choice. 
1 Introduction 
Parsing sentences using statistical information 
gathered from a treebank was first examined a 
decade ago in (Chitrao and Grishman, 1990) 
and is by now a fairly well-studied problem 
((Charniak, 1997), (Collins, 1997), (Ratna- 
parkhi, 1997)). But to date, the end product of 
the parsing process has for the most part been 
a bracketing with simple constituent labels like 
NP, VP, or SBAR. The Penn treebank contains a 
great deal of additional syntactic and seman- 
tic information from which to gather statistics; 
reproducing more of this information automat- 
ically is a goal which has so far been mostly 
ignored. This paper details a process by which 
some of this information--the function tags-- 
may be recovered automatically. 
In the Penn treebank, there are 20 tags (fig- 
ure 1) that can be appended to constituent la- 
bels in order to indicate additional information 
about the syntactic or semantic role of the con- 
* This research was funded in part by NSF grants LIS- 
SBR-9720368 and IGERT-9870676. 
stituent. We have divided them into four cate- 
gories (given in figure 2) based on those in the 
bracketing guidelines (Bies et al., 1995). A con- 
stituent can be tagged with multiple tags, but 
never with two tags from the same category. 1 
In actuality, the case where a constituent has 
tags from all four categories never happens, but 
constituents with three tags do occur (rarely). 
At a high level, we can simply say that hav- 
ing the function tag information for a given text 
is useful just because any further information 
would help. But specifically, there are distinct 
advantages for each of the various categories. 
Grammatical tags are useful for any application 
trying to follow the thread of the text--they find 
the 'who does what' of each clause, which can 
be useful to gain information about the situa- 
tion or to learn more about the behaviour of 
the words in the sentence. The form/function 
tags help to find those constituents behaving in 
ways not conforming to their labelled type, as 
well as further clarifying the behaviour of ad- 
verbial phrases. Information retrieval applica- 
tions specialising in describing events, as with a 
number of the MUC applications, could greatly 
benefit from some of these in determining the 
where-when-why of things. Noting a topicalised 
constituent could also prove useful to these ap- 
plications, and it might also help in discourse 
analysis, or pronoun resolution. Finally, the 
'miscellaneous' tags are convenient at various 
times; particularly the CLI~ 'closely related' tag, 
which among other things marks phrasal verbs 
and prepositional ditransitives. 
To our knowledge, there has been no attempt 
so far to recover the function tags in pars- 
ing treebank text. In fact, we know of only 
1There is a single exception in the corpus: one con- 
stituent is tagged with -LOC-I~R. This appears to be an 
error. 
234 
ADV Non-specific adverbial 
BNF Benefemtive 
CLF It-cleft 
CLR 'Closely related' 
DIR Direction 
DTV Dative 
EXT Extent 
HLN Headline 
LGS Logical subject 
L0C Location 
MNI~ Manner 
N0M Nominal 
PRD Predicate 
PRP Purpose 
PUT Locative complement of 'put' 
SBJ Subject 
TMP Temporal 
TPC Topic 
TTL Title 
V0C Vocative 
Grammatical 
DTV 0.48% 
LGS 3.0% 
PRD 18.% 
PUT 0.26% 
SBJ 78.% 
v0c 0.025% 
Figure 1: Penn treebank function tags 
53.% Form/Function 37.% Topicalisation 2.2% 
0.25% NOM 6.8% 2.5% TPC 100% 2.2% 
1.5% ADV 11.% 4.2% 
9.3% BN'F 0.072% 0.026% 
0.13% DIR 8.3% 3.0% 
41.% EXT 3.2% 1.2% 
0.013% LOC 25.% 9.2% 
MNR 6.2% 2.3% 
PI~ 5.2% 1.9% 
33.% 12.% 
Miscellaneous 9.5% 
CLR 94.% 8.8% 
CLF 0.34% 0.03% 
HLN 2.6% 0.25% 
TTL 3.1% 0.29% 
Figure 2: Categories of function tags and their relative frequencies 
one project that used them at all: (Collins, 
1997) defines certain constituents as comple- 
ments based on a combination of label and func- 
tion tag information. This boolean condition is 
then used to train an improved parser. 
2 Features 
We have found it useful to define our statisti- 
cal model in terms of features. A 'feature', in 
this context, is a boolean-valued function, gen- 
erally over parse tree nodes and either node la- 
bels or lexical items. Features can be fairly sim- 
ple and easily read off the tree (e.g. 'this node's 
label is X', 'this node's parent's label is Y'), or 
slightly more complex ('this node's head's part- 
of-speech is Z'). This is concordant with the us- 
age in the maximum entropy literature (Berger 
et al., 1996). 
When using a number of known features to 
guess an unknown one, the usual procedure is 
to calculate the value of each feature, and then 
essentially look up the empirically most proba- 
ble value for the feature to be guessed based on 
those known values. Due to sparse data, some 
of the features later in the list may need to be 
ignored; thus the probability of an unknown fea- 
ture value would be estimated as 
P(flYl, • •, Y,) 
P(flfl, f2,...,fj), j < n, (1) 
where/3 refers to an empirically observed prob- 
ability. Of course, if features 1 through i only 
co-occur a few times in the training, this value 
may not be reliable, so the empirical probability 
is usually smoothed: 
P(flfl, Ii) 
AiP(flfl, fa,..., fi) 
+ (2) 
The values for )~i can then be determined ac- 
cording to the number of occurrences of features 
1 through i together in the training. 
One way to think about equation 1 (and 
specifically, the notion that j will depend on 
the values of fl... fn) is as follows: We begin 
with the prior probability of f. If we have data 
indicating P(flfl), we multiply in that likeli- 
hood, while dividing out the original prior. If 
we have data for/3(flfl, f2), we multiply that 
in while dividing out the P(flfl) term. This is 
repeated for each piece of feature data we have; 
at each point, we are adjusting the probability 
235 
P(flfl,f2,... ,fn) p(/) P(SlA) P(SlSl, S:) P(f) P(flfl) 
P(flfl,..., Yi-1, A) 
-,_-o " p- ff, 
P(flft, $2,..., f~) 
P(flA, A,...,f¢-x) j<n 
(3) 
we already have estimated. If knowledge about 
feature fi makes S more likely than with just 
fl... fi-1, the term where fi is added will be 
greater than one and the running probability 
will be adjusted upward. This gives us the new 
probability shown in equation 3, which is ex- 
actly equivalent to equation 1 since everything 
except the last numerator cancels out of the 
equation. The value of j is chosen such that 
features fl..-fj are sufficiently represented in 
the training data; sometimes all n features are 
used, but often that would cause sparse data 
problems. Smoothing is performed on this equa- 
tion exactly as before: each term is interpolated 
between the empirical value and the prior esti- 
mated probability, according to a value of Ai 
that estimates confidence. But aside from per- 
haps providing a new way to think about the 
problem, equation 3 is not particularly useful 
as it is--it is exactly the same as what we had 
before. Its real usefulness comes, as shown in 
(Charniak, 1999), when we move from the no- 
tion of a feature chain to a feature tree. 
These feature chains don't capture everything 
we'd like them to. If there are two independent 
features that are each relatively sparse but occa- 
sionally carry a lot of information, then putting 
one before the other in a chain will effectively 
block the second from having any effect, since 
its information is (uselessly) conditioned on the 
first one, whose sparseness will completely di- 
lute any gain. What we'd really like is to be able 
to have a feature tree, whereby we can condition 
those two sparse features independently on one 
common predecessor feature. As we said be- 
fore, equation 3 represents, for each feature fi, 
the probability of f based on fi and all its pre- 
decessors, divided by the probability of f based 
only on the predecessors. In the chain case, this 
means that the denominator is conditioned on 
every feature from 1 to i - 1, but if we use a 
feature tree, it is conditioned only on those fea- 
tures along the path to the root of the tree. 
A notable issue with feature trees as opposed 
to feature chains is that the terms do not all 
cancel out. Every leaf on the tree will be repre- 
target ~ 
feature 
Figure 3: A small example feature tree 
sented in the numerator, and every fork in the 
tree (from which multiple nodes depend) will 
be represented at least once in the denomina- 
tor. For example: in figure 3 we have a small 
feature tree that has one target feature and four 
conditioning features. Features b and d are in- 
dependent of each other, but each depends on a; 
c depends directly only on b. The unsmoothed 
version of the corresponding equation would be 
P(fla, b, c, d) ,~ 
p,~ P(fla) ~)(f\]a, b) P(f\[a, b, c) P(fla, d) 
which, after cancelling of terms and smoothing, 
results in 
P(fla, b, c, d) P(fla, b, c)P(fla, d) P(fla) (4) 
Note that strictly speaking the result is not a 
probability distribution. It could be made into 
one with an appropriate normalisation--the 
so-called partition function in the maximum- 
entropy literature. However, if the indepen- 
dence assumptions made in the derivation of 
equation 4 are good ones, the partition func- 
tion will be close to 1.0. We assume this to be 
the case for our feature trees. 
Now we return the discussion to function tag- 
ging. There are a number of features that seem 
236 
function 
tag label 
succeeding preceding 
, ,./-d~el laf)el 
pare_p~ 
gra-'n~arent's parent's 
label head's POS 
grandparent's 
h~POS 
headS~ parent's 
P~ead 
head 
alt-head's 
POs alt-~ead 
Figure 4: The feature tree used to guess function tags 
to condition strongly for one function tag or an- 
other; we have assembled them into the feature 
tree shown in figure 4. 2 This figure should be 
relatively self-explanatory, except for the notion 
of an 'alternate head'; currently, an alternate 
head is only defined for prepositional phrases, 
and is the head of the object of the preposi- 
tional phrase. This data is very important in 
distinguishing, for example, 'by John' (where 
John might be a logical subject) from 'by next 
year' (a temporal modifier) and 'by selling it' 
(an adverbial indicating manner). 
3 Experiment 
In the training phase of our experiment, we 
gathered statistics on the occurrence of func- 
tion tags in sections 2-21 of the Penn treebank. 
Specifically, for every constituent in the tree- 
bank, we recorded the presence of its function 
tags (or lack thereof) along with its condition- 
ing information. From this we calculated the 
empirical probabilities of each function tag ref- 
erenced in section 2 of this paper. Values of )~ 
were determined using EM on the development 
corpus (treebank section 24). 
To test, then, we simply took the output of 
our parser on the test corpus (treebank section 
23), and applied a postprocessing step to add 
function tags. For each constituent in the tree, 
we calculated the likelihood of each function tag 
according to the feature tree in figure 4, and 
for each category (see figure 2) we assigned the 
most likely function tag (which might be the 
null tag). 
2The reader will note that the 'features' listed in the 
tree are in fact not boolean-valued; each node in the 
given tree can be assumed to stand for a chain of boolean 
features, one per potential value at that node, exactly 
one of which will be true. 
4 Evaluation 
To evaluate our results, we first need to deter- 
mine what is 'correct'. The definition we chose 
is to call a constituent correct if there exists in 
the correct parse a constituent with the same 
start and end points, label, and function tag 
(or lack thereof). Since we treated each of the 
four function tag categories as a separate fea- 
ture for the purpose of tagging, evaluation was 
also done on a per-category basis. 
The denominator of the accuracy measure 
should be the maximum possible number we 
could get correct. In this case, that means 
excluding those constituents that were already 
wrong in the parser output; the parser we used 
attains 89% labelled precision-recall, so roughly 
11% of the constituents are excluded from the 
function tag accuracy evaluation. (For refer- 
ence, we have also included the performance of 
our function tagger directly on treebank parses; 
the slight gain that resulted is discussed below.) 
Another consideration is whether to count 
non-tagged constituents in our evaluation. On 
the one hand, we could count as correct any 
constituent with the correct tag as well as any 
correctly non-tagged constituent, and use as 
our denominator the number of all correctly- 
labelled constituents. (We will henceforth refer 
to this as the 'with-null' measure.) On the other 
hand, we could just count constituents with the 
correct tag, and use as our denominators the 
total number of tagged, correctly-labelled con- 
stituents. We believe the latter number ('no- 
null') to be a better performance metric, as it 
is not overwhelmed by the large number of un- 
tagged constituents. Both are reported below. 
237 
Category 
Grammatical 
Form/Function 
Topicalisation 
Miscellaneous 
Overall 
Table 1: Baseline performance 
Baseline 1 
(never tag) Tag Precision 
86.935% SBJ 10.534% 
91.786% THP 3.105% 
99.406% TPC 0.594% 
98.436% CLR 1.317% 
94.141% -- 3.887% 
Baseline 2 (always choose most likely tag) 
Recall F-measure 
80.626% 18.633% 
37.795% 5.738% 
100.00% 1.181% 
84.211% 2.594% 
66.345% 7.344% 
Table 2: Performance within each category 
With-null ---No-null-- 
Category Accuracy Precision Recall F-measure 
Grammatical 98.909% 95.472% 95.837% 95.654% 
Form/Function 97.104% 80.415% 77.595% 78.980% 
Topicalisation 99.915% 92.195% 93.564% 92.875% 
Miscellaneous 98.645% 55.644% 65.789% 60.293% 
5 Results 
5.1 Baselines 
There are, it seems, two reasonable baselines 
for this and future work. First of all, most con- 
stituents in the corpus have no tags at all, so 
obviously one baseline is to simply guess no tag 
for any constituent. Even for the most com- 
mon type of function tag (grammatical), this 
method performs with 87% accuracy. Thus the 
with-null accuracy of a function tagger needs to 
be very high to be significant here. 
The second baseline might be useful in ex- 
amining the no-null accuracy values (particu- 
larly the recall): always guess the most common 
tag in a category. This means that every con- 
stituent gets labelled with '-SBJ-THP-TPC-CLR' 
(meaning that it is a topicalised temporal sub- 
ject that is 'closely related' to its verb). This 
combination of tags is in fact entirely illegal 
by the treebank guidelines, but performs ad- 
equately for a baseline. The precision is, of 
course, abysmal, for the same reasons the first 
baseline did so well; but the recall is (as one 
might expect) substantial. The performances 
of the two baseline measures are given in Table 
1. 
5.2 Performance in individual 
categories 
In table 2, we give the results for each category. 
The first column is the with-null accuracy, and 
the precision and recall values given are the no- 
null accuracy, as noted in section 4. 
Grammatical tagging performs the best of the 
four categories. Even using the more difficult 
no-null accuracy measure, it has a 96% accu- 
racy. This seems to reflect the fact that gram- 
matical relations can often be guessed based on 
constituent labels, parts of speech, and high- 
frequency lexical items, largely avoiding sparse- 
data problems. Topicalisation can similarly be 
guessed largely on high-frequency information, 
and performed almost as well (93%). 
On the other hand, we have the 
form/function tags and the 'miscellaneous' 
tags. These are characterised by much more 
semantic information, and the relationships 
between lexical items are very important, 
making sparse data a real problem. All the 
same, it should be noted that the performance 
is still far better than the baselines. 
5.3 Performance with other feature 
trees 
The feature tree given in figure 4 is by no means 
the only feature tree we could have used. In- 
238 
Table 3: Overall performance on different inputs 
With-null --No-null- 
Category Accuracy Precision Recall F-measure 
Parsed 98.643% 87.173% 87.381% 87.277% 
Treebank 98.805% 88.450% 88.493% 88.472% 
deed, we tried a number of different trees on the 
development corpus; this tree gave among the 
best overall results, with no category perform- 
ing too badly. However, there is no reason to 
use only one feature tree for all four categories; 
the best results can be got by using a separate 
tree for each one. One can thus achieve slight 
(one to three point) gains in each category. 
5.4 Overall performance 
The overall performance, given in table 3, ap- 
pears promising. With a tagging accuracy of 
about 87%, various information retrieval and 
knowledge base applications can reasonably ex- 
pect to extract useful information. 
The performance given in the first row is (like 
all previously given performance values) the 
function-tagger's performance on the correctly- 
labelled constituents output by our parser. For 
comparison, we also give its performance when 
run directly on the original treebank parse; since 
the parser's accuracy is about 89%, working di- 
rectly with the treebank means our statistics 
are over roughly 12% more constituents. This 
second version does slightly better. 
The main reason that tagging does worse on 
the parsed version is that although the con- 
stituent itself may be correctly bracketed and la- 
belled, its exterior conditioning information can 
still be incorrect. An example of this that ac- 
tually occurred in the development corpus (sec- 
tion 24 of the treebank) is the 'that' clause in 
the phrase 'can swallow the premise that the re- 
wards for such ineptitude are six-figure salaries', 
correctly diagrammed in figure 5. The function 
tagger gave this SBAR an ADV tag, indicating an 
unspecified adverbial function. This seems ex- 
tremely odd, given that its conditioning infor- 
mation (nodes circled in the figure) clearly show 
that it is part of an NP, and hence probably mod- 
ifies the preceding NN. Indeed, the statistics give 
the probability of an ADV tag in this condition- 
ing environment as vanishingly small. 
vP 
the ( premise )~ 
Figure 5: SBAR and conditioning info 
the premise ~ ... 
Figure 6: SBAR and conditioning info, as parsed 
However, this was not the conditioning infor- 
mation that the tagger received. The parser 
had instead decided on the (incorrect) parse in 
figure 6. As such, the tagger's decision makes 
much more sense, since an SBAR under two VPs 
whose heads are VB and MD is rather likely to be 
an ADV. (For instance, the 'although' clause of 
the sentence 'he can help, although he doesn't 
want to.' has exactly the conditioning environ- 
ment given in figure 6, except that its prede- 
cessor is a comma; and this SBAR would be cor- 
rectly tagged ADV.) The SBAR itself is correctly 
bracketed and labelled, so it still gets counted 
in the statistics. Happily, this sort of case seems 
to be relatively rare. 
239 
Another thing that lowers the overall perfor- 
mance somewhat is the existence of error and in- 
consistency in the treebank tagging. Some tags 
seem to have been relatively easy for the human 
treebank taggers, and have few errors. Other 
tags have explicit caveats that, however well- 
justified, proved difficult to remember for the 
taggers--for instance, there are 37 instances of 
a PP being tagged with LGS (logical subject) in 
spite of the guidelines specifically saying, '\[LGS\] 
attaches to the NP object of by and not to the 
PP node itself.' (Bies et al., 1995) Each mistag- 
ging in the test corpus can cause up to two spu- 
rious errors, one in precision and one in recall. 
Still another source of difficulty comes when the 
guidelines are vague or silent on a specific issue. 
To return to logical subjects, it is clear that 'the 
loss' is a logical subject in 'The company was 
hurt by the loss', but what about in 'The com- 
pany was unperturbed by the loss' ? In addition, 
a number of the function tags are authorised for 
'metaphorical use', but what exactly constitutes 
such a use is somewhat inconsistently marked. 
It is as yet unclear just to what degree these 
tagging errors in the corpus are affecting our 
results. 
6 Conclusion 
This work presents a method for assigning func- 
tion tags to text that has been parsed to the 
simple label level. Because of the lack of prior 
research on this task, we are unable to com- 
pare our results to those of other researchers; 
but the results do seem promising. However, a 
great deal of future work immediately suggests 
itself: 
• Although we tested twenty or so feature 
trees besides the one given in figure 4, the 
space of possible trees is still rather un- 
explored. A more systematic investiga- 
tion into the advantages of different feature 
trees would be useful. 
• We could add to the feature tree the val- 
ues of other categories of function tag, or 
the function tags of various tree-relatives 
(parent, sibling). 
• One of the weaknesses of the lexical fea- 
tures is sparse data; whereas the part of 
speech is too coarse to distinguish 'by John' 
(LGS) from 'by Monday' (TMP), the lexi- 
cal information may be too sparse. This 
could be assisted by clustering the lexical 
items into useful categories (names, dates, 
etc.), and adding those categories as an ad- 
ditional feature type. 
• There is no reason to think that this work 
could not be integrated directly into the 
parsing process, particularly if one's parser 
is already geared partially or entirely to- 
wards feature-based statistics; the func- 
tion tag information could prove quite use- 
ful within the parse itself, to rank several 
parses to find the most plausible. 

References 

Adam L. Berger, Stephen A. Della Pietra, 
and Vincent J. Della Pietra. 1996. A 
maximum entropy approach to natural  processing. Computational Linguistics, 
22(1):39-71. 

Ann Bies, Mark Ferguson, Karen Katz, and 
Robert MacIntyre, 1995. Bracketing Guidelines for Treebank H Style Penn Treebank Project, January. 

Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word 
statistics. In Proceedings of the Fourteenth 
National Conference on Artificial Intelligence, pages 598-603, Menlo Park. AAAI 
Press/MIT Press. 

Eugene Charniak. 1999. A maximum-entropy-inspired parser. Technical Report CS-99-12, 
Brown University, August. 

Mahesh V. Chitrao and Ralph Grishman. 1990. 
Statistical parsing of messages. In DARPA 
Speech and Language Workshop, pages 263-266. 

Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the 
Association for Computational Linguistics, 
pages 16-23. 

Adwait Ratnaparkhi. 1997. A linear observed 
time statistical parser based on maximum entropy models. In Proceedings of the Second 
Annual Conference on Empirical Methods in 
Natural Language Processing, pages 1-10. 
