In: Proceedings of CoNLL-2000 and LLL-2000, pages 145-147, Lisbon, Portugal, 2000. 
Shallow Parsing as Part-of-Speech Tagging* 
Miles Osborne 
University of Edinburgh 
Division of Informatics 
2 Buccleuch Place 
Edinburgh EH8 9LW , Scotland 
osborne@cogsci, ed. ac. uk 
Abstract 
Treating shallow parsing as part-of-speech tag- 
ging yields results comparable with other, more 
elaborate approaches. Using the CoNLL 2000 
training and testing material, our best model 
had an accuracy of 94.88%, with an overall FB1 
score of 91.94%. The individual FB1 scores for 
NPs were 92.19%, VPs 92.70% and PPs 96.69%. 
1 Introduction 
Shallow parsing has received a reasonable 
amount of attention in the last few years (for 
example (Ramshaw and Marcus, 1995)). In 
this paper, instead of modifying some existing 
technique, or else proposing an entirely new ap- 
proach, we decided to build a shallow parser us- 
ing an off-the-shelf part-of-speech (POS) tagger. 
We deliberately did not modify the POS tag- 
ger's internal operation in any way. Our results 
suggested that achieving reasonable shallow- 
parsing performance does not in general require 
anything more elaborate than a simple POS tag- 
ger. However, an error analysis suggested the 
existence of a small set of constructs that are 
not so easily characterised by finite-state ap- 
proaches such as ours. 
2 The Tagger 
We used Ratnaparkhi's maximum entropy- 
based POS tagger (Ratnaparkhi, 1996). When 
tagging, the model tries to recover the most 
likely (unobserved) tag sequence, given a se- 
quence of observed words. 
For our experiments, we used the binary-only 
distribution of the tagger (Ratnaparkhi, 1996). 
" The full version of this paper can be found at 
http://www.cogsci.ed.ac.uk/'osborne/shallow.ps 
3 Convincing the Tagger to Shallow 
Parse 
The insight here is that one can view (some 
of) the differences between tagging and (shal- 
low) parsing as one of context: shallow pars- 
ing requires access to a greater part of the 
surrounding lexical/POS syntactic environment 
than does simple POS tagging. This extra in- 
formation can be encoded in a state. 
However, one must balance this approach 
with the fact that as the amount of information 
in a state increases, with limited training ma- 
terial, the chance of seeing such a state again 
in the future diminishes. We therefore would 
expect performance to increase as we increased 
the amount of information in a state, and then 
decrease when overfitting and/or sparse statis- 
tics become dominate factors. 
We trained the tagger using 'words' that were 
various 'configurations' (concatenations) of ac- 
tual words, POS tags, chunk-types, and/or suf- 
fixes or prefixes of words and/or chunk-types. 
By training upon these concatenations, we help 
bridge the gap between simple POS tagging and 
shallow parsing. 
In the rest of the paper, we refer to what the 
tagger considers to be a word as a configura- 
tion. A configuration will be a concatenation of 
various elements of the training set relevant to 
decision making regarding chunk assignment. A 
'word' will mean a word as found in the train- 
ing set. 'Tags' refer to the POS tags found in 
the training set. Again, such tags may be part 
of a configuration. We refer to what the tagger 
considers as a tag as a prediction. Predictions 
will be chunk labels. 
4 Experiments 
We now give details of the experiments we ran. 
To make matters clearer, consider the following 
145 
fragment of the training set: 
Word 
POS Tag 
Chunk 
Wl w2 w3 
tl t2 t3 
Cl c2 c3 
Words are wl,w2 and w3, tags are t~L,t2 and t3 
and chunk labels are cl, c2 and ca. Throughout, 
we built various configurations when predicting 1 
the chunk label for word wl. 
With respect to the situation just mentioned 
(predicting the label for word wl), we gradu- 
ally increased the amount of information in each 
configuration as follows: 
1. A configuration consisting of just words 
(word wl). Results: 
Chunk type 
Overall 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
P 
88.06 
67.57 
74.34 
54.55 
100.00 
0.00 
87.84 
94.80 
71.00 
82.30 
86.68 
R 
88.71 
51.37 
74.25 
66.67 
50.00 
0.00 
89.41 
95.91 
66.98 
72.15 
88.15 
FB1 
88.38 
58.37 
74.29 
6O.00 
66.67 
0.00 
88.62 
95.35 
68.93 
76.89 
87.41 
Overall accuracy: 92.76% 
2. A configuration consisting of just tags (tag 
tl). Results: 
Chunk type 
Overall 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
P R FB1 
88.15 88.07 88.11 
67.99 54.79 60.68 
71.61 70.79 71.20 
35.71 55.56 43.48 
0.00 0.00 0.00 
0.00 0.00 0.00 
89.47 89.57 89.52 
87.70 95.28 91.33 
52.27 21.70 30.67 
83.92 31.21 45.50 
90.38 91.18 \[ 90.78 
Overall accuracy 92.66%. 
3. Both words, tags and the current chunk 
label (wl, tl, Cl) in a configuration. We 
allowed the tagger access to the current 
chunk label by training another model with 
1For space reasons, we had to remove many of these 
experiments. The longer version of the paper gives rele- 
vant details. 
configurations consisting of tags and words 
(wl and tl). The training set was then re- 
duced to consist of just tag-word configura- 
tions and tagged using this model. After- 
wards, we collected the predictions for use 
in the second model. Results: 
Chunk type 
Overall 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
P 
89.79 
69.61 
74.72 
54.55 
50.00 
0.00 
89.80 
95.15 
71.84 
85.63 
89.54 
R 
90.70 
57.53 
77.14 
66.67 
50.00 
0.00 
91.12 
96.26 
69.81 
80.19 
91.31 
FB1 
90.24 
63.00 
75.91 
60.00 
50.00 
0.00 
90.4 
95.70 
70.81 
82.82 
90.41 
. 
Overall accuracy: 93.79% 
The final configuration made an attempt 
to take deal with sparse statistics. It con- 
sisted of the current tag tl, the next tag t2, 
the current chunk label cl, the last two let- 
ters of the next chunk label c2, the first two 
letters of the current word wl and the last 
four letters of the current word wl. This 
configuration was the result of numerous 
experiments and gave the best overall per- 
formance. The results can be found in Ta- 
ble 1. 
We remark upon our experiments in the com- 
ments section. 
5 Error Analysis 
We examined the performance of our final 
model with respect to the testing material and 
found that errors made by our shallow parser 
could be grouped into three categories: diffi- 
cult syntactic constructs, mistakes made in the 
training or testing material by the annotators, 
and errors peculiar to our approach. 2 
Taking each category of the three in turn, 
problematic constructs included: co-ordination, 
punctuation, treating ditransitive VPs as being 
transitive VPs, confusions regarding adjective 
or adverbial phrases, and copulars seen as be- 
ing possessives. 
2The raw results can be found at: http://www.cog 
sci.ed.ac.uk/-osborne/conll00-results.txt The mis-anal- 
ysed sentences can be found at: http://www.cogsci.ed. 
ac.uk/-osborne/conll00-results.txt. 
146 
Mistakes (noise) in the training and testing 
material were mainly POS tagging errors. An 
additional source of errors were odd annotation 
decisions. 
The final source of errors were peculiar to our 
system. Exponential distributions (as used by 
our tagger) assign a non-zero probability to all 
possible events. This means that the tagger will 
at times assign chunk labels that are illegal, for 
example assigning a word the label I-NP when 
the word is not in a NP. Although these errors 
were infrequent, eliminating them would require 
'opening-up' the tagger and rejecting illegal hy- 
pothesised chunk labels from consideration. 
6 Comments 
As was argued in the introduction, increasing 
the size of the context produces better results, 
and such performance is bounded by issues such 
as sparse statistics. Our experiments suggest 
that this was indeed true. 
We make no claims about the generality of 
our modelling. Clearly it is specific to the tagger 
used. 
In more detail, we found that: 
• PPs seem easy to identify. 
• ADJP and ADVP chunks were hard to 
identify correctly. We suspect that im- 
provements here require greater syntactic 
information than just base-phrases. 
• Our performance at NPs should be 
improved-upon. In terms of modelling, we 
did not treat any chunk differently from 
any other chunk. We also did not treat any 
words differently from any other words. 
• The performance using just words and just 
POS tags were roughly equivalent. How- 
ever, the performance using both sources 
was better than when using either source 
of information in isolation. The reason for 
this is that words and POS tags have differ- 
ent properties, and that together, the speci- 
ficity of words can overcome the coarseness 
of tags, whilst the abundance of tags can 
deal with the sparseness of words. 
Our results were not wildly worse than those 
reported by Buchholz et al (Sabine Buchholz 
and Daelemans, 1999). This comparable level of 
performance suggests that shallow parsing (base 
test data 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
precision 
72.42% 
75.94% 
50.00% 
100.00% 
0.00% 
91.92% 
95.95% 
73.33% 
86.40% 
92.13% 
recall 
64.16% 
79.10% 
55.56% 
50.00% 
0.00% 
92.45% 
97.44% 
72.64% 
80.75% 
93.28% 
all 91.65% 92.23% 
F fl=l 
68.04 
77.49 
52.63 
66.67 
0.00 
92.19 
96.69 
72.99 
83.48 
92.70 
91.94 
Table 1: The results for configuration 4. Overall 
accuracy: 94.88% 
phrasal recognition) is a fairly easy task. Im- 
provements might come from better modelling, 
dealing with illegal chunk sequences, allowing 
multiple chunks with confidence intervals, sys- 
tem combination etc, but we feel that such im- 
provements will be small. Given this, we believe 
that base-phrasal chunking is close to being a 
solved problem. 
Acknowledgements 
We would like to thank Erik Tjong Kim Sang 
for supplying the evaluation code, and Donnla 
Nic Gearailt for dictating over the telephone, 
and from the top-of-her-head, a Perl program 
to help extract wrongly labelled sentences from 
the results. 

References 
Claire Cardie and David Pierce. 1998. Error-Driven 
Pruning of Treebank Grammars for Base Noun 
Phrase Identification. In Proceedings o/the 17 th 
International Conference on Computational Lin- 
guistics, pages 218-224. 
Lance A. Ramshaw and Mitchell P. Marcus. 
1995. Text Chunking Using Transformation- 
Based Learning. In Proceedings of the 3 rd ACL 
Workshop on Very Large Corpora, pages 82-94, 
June. 
Adwait Ratnaparkhi. 1996. A Maximum En- 
tropy Part-Of-Speech Tagger. In Proceed- 
ings of Empirical Methods in Natural Lan- 
guage, University of Pennsylvania, May. Tagger: 
ftp ://ftp. cis. upenn, edu/pub/adwait / j mx. 
Jorn Veenstra Sabine Buchholz and Walter Daele- 
mans. 1999. Cascaded Grammatical Relation As- 
signment. In Proceedings of EMNLP/VLC-99. 
Association for Computational Linguistics. 
