In: Proceedings of CoNLL-2000 and LLL-2000, pages 139-141, Lisbon, Portugal, 2000. 
Chunking with Maximum Entropy Models 
Rob Koeling 
SRI Cambridge 
koel±ng0cam, srJ.. com 
1 Introduction 
In this paper I discuss a first attempt to create a 
text chunker using a Maximum Entropy model. 
The first experiments, implementing classifiers 
that tag every word in a sentence with a phrase- 
tag using very local lexical information, part- 
of-speech tags and phrase tags of surrounding 
words, give encouraging results. 
2 Maximum Entropy models 
Maximum Entropy (MaxEnt) models (Jaynes, 
1957) are exponential models that implement 
the intuition that if there is no evidence to 
favour one alternative solution above another, 
both alternatives should be equally likely. In 
order to accomplish this, as much information 
as possible about the process you want to model 
must be collected. This information consists 
of frequencies of events relevant to the process. 
The frequencies of relevant events are consid- 
ered to be properties of the process. When 
building a model we have to constrain our at- 
tention to models with these properties. In 
most cases the process is only partially de- 
scribed. The MaxEnt framework now demands 
that from all the models that satisfy these con- 
straints, we choose the model with the flattest 
probability distribution. This is the model with 
the highest entropy (given the fact that the con- 
straints are met). When we are looking for a 
conditional model P(w\]h), the MaxEnt solution 
has the form: 
1 . e~ i Aifi(h,w) P(wlh) = Z(h) 
where fi(h,w) refers to a (binary valued) fea- 
ture function that describes a certain event; Ai 
is a parameter that indicates how important fea- 
ture fi is for the model and Z(h) is a normali- 
sation factor. 
In the last few years there has been an in- 
creasing interest in applying MaxEnt models for 
NLP applications (Ratnaparkhi, 1998; Berger et 
al., 1996; Rosenfeld, 1994; Ristad, 1998). The 
attraction of the framework lies in the ease with 
which different information sources used in the 
modelling process are combined and the good 
results that are reported with the use of these 
models. Another strong point of this framework 
is the fact that general software can easily be 
applied to a wide range of problems. For these 
experiments we have used off-the-shelf software 
(Maccent) (Dehaspe, 1997). 
3 An MaxEnt chunker 
3.1 Attributes used 
First need to be decided which information 
sources might help to predict the chunk tag. We 
need to work with the information that is in- 
cluded in the WSJ corpus, so the choice is first 
limited to: 
• Current word 
• POS tag of current word 
• Surrounding words 
• POS tags of surrounding words 
All these sources will be used, but in case of the 
information sources using surrounding words we 
will have to decide how much context is taken 
into account. I did not perform exhaustive tests 
on finding the best configuration, but following 
(Tjong Kim Sang and Veenstra, 1999; Ratna- 
parkhi, 1997) I only used very local context. In 
these experiments I used a left context of three 
words and a right context of two words. Exper- 
iments described in (Mufioz et al., 1999) suc- 
cessfully used larger contexts, but the few tests 
that I performed to confirm this did not give 
evidence that we could benefit significantly by 
extending the context. Apart from information 
139 
given by the WSJ corpus, information generated 
by the model itself will also be used: 
• Chunk tags of previous words 
It would of course sometimes be desirable to use 
the chunk tags of the following words also, but 
these are not instantly available and therefore 
we will need a cascaded approach. I have exper- 
imented with a cascaded chunker, but I did not 
improve the results significantly. 
In order to use previously predicted chunk 
tags, the evaluation part of the Maccent soft- 
ware had to be modified. The evaluation pro- 
gram needs a previously created file with all 
the attributes and the actual class, but the 
chunk tag of the previous two words cannot 
be provided beforehand as they are produced 
in the process of evaluation. A ca~scaded ap- 
proach where after the first run the predicted 
tags are added to the file with test data is also 
not completely satisfactory as the provided tags 
are then predicted on basis of all the other at- 
tributes, but not the previous chunk tags. Ide- 
ally the information about the tags of the pre- 
vious words would be added during evaluation. 
This required some modification of the evalua- 
tion script. 
3.2 Results 
The experiments are evaluated using the follow- 
ing standard evaluation measures: 
• Tagging accuracy = 
Number of correct ta~6ed words 
Total number of words 
• Recall = 
Number of correct proposed baseNWs 
Number of correct baseNWs 
• Precision = 
Number of correct proposed baseNWs 
Number of proposed baseNWs 
• F~-score (f12 + 1)'Recall'Precisi°n 
~ fl2.Recall+Precision 
(fl=l in all experiments.) 
In all the experiments a left context of 3 words 
and a right context of 2 words was used. The 
part of speech tags of the surrounding words and 
the word itself were all used as atomic features. 
The lexical information used consisted of the 
previous word, the current word and the next 
word. The word W-2 was omitted because it 
did not seem to improve the model. Using only 
these atomic features, the model scored an tag- 
ging accuracy of about 95.5% and a F-score of 
about 90.5 %. Well below the reported results in 
the literature. Adding i~atures combining POS 
tags improved the results significantly to just 
below state of the art scores. Finally 2 complex 
features involving NP chunk tags predicted for 
previous words were added. The most successful 
set of features used in our experiments is given 
in figure 1. It is not claimed that this is the best 
Template 
TAG_3 
POS-3 
POS-3/POSo 
POS-3/POS_2 
TAG_3/TAG_2/TAG_i/POSo 
TAG-2 
POS-2 
W-1 Previous word 
TAG_I 
POS-1 
Wo Current word 
POSo 
W+i POS+i 
POS_2/POSo 
POS-2/POS_I 
TAG_2/TAG_i/POSo 
POS_~/TAG_~/POSo 
POS_i/POSo/POS+i 
POS-2/POS-1/POSo/POS+i 
POSo/POS+i 
POS+2 
POSo/POS+2 
POSo/POS+i/POS+2 
Meaning 
Base-NP tag of W-3 
Part of Speech tag of W-3 
Figure 1: Feature set-up for best scoring exper- 
iment 
set of features possible for this task. Trying new 
feature combinations, by adding them manually 
and testing the new configuration is a time con- 
suming and not very interesting activity. Es- 
pecially when the scores are close to the best 
published scores, adding new features have little 
impact on the behaviour of the model. An al- 
gorithm that discovers the interaction between 
features and suggests which features could be 
combined to improve the model would be very 
helpful here. I did not include any complex fea- 
tures involving lexical information. It might be 
140 
useful to include more features with lexical in- 
formation if more training data is available (for 
example the full R&M data set consisting of sec- 
tion 2-21 of WSJ). 
For feature selection a simple count cut-off 
was used. I experimented with several combi- 
nations of thresholds and the number of itera- 
tions used to train the model. When the thresh- 
old was set to 2, unique contexts (can be prob- 
lematic during training of the model; see (Rat- 
naparkhi, 1998)) did not occur very frequently 
anymore and an upper bound on the number of 
iterations did not seem to be necessary. It was 
found that (using a threshold of 2 for every sin- 
gle feature) after about 100 iterations the model 
did not improve very much anymore. Using the 
feature setup given in figure 1 a threshold of 2 
for all the features and allowing the model to 
train over 100 iterations, the scores given in ta- 
ble 1 were obtained. 
4 Concluding remarks 
The first observation that I would like to make 
here, is the fact that it was relatively easy to 
get results that are comparable with previously 
published results. Even though some improve- 
ment is to be expected when more detailed fea- 
tures, more context and/or more training data 
is used, it seems to be necessary to incorporate 
other sources of information to improve signifi- 
cantly on these results. 
Further, it is not satisfactory to find out what 
attribute combinations to use by trying new 
combinations and testing them. It might be 
worth to examine ways to automatically de- 
tect which feature combinations are promis- 
ing (Mikheev, forthcoming; Della Pietra et al., 
1997). 

References 
Adam L. Berger, Stephen A. Della Pietra, and Vin- 
cent J. Della Pietra. 1996. A maximum entropy 
approach to natural  processing. Com- 
putational Linguistics, 22(1). 
Luc Dehaspe. 1997. Maximum entropy modeling 
with clausal constraints. In Proceedings of the 7th 
International Workshop on Inductive Logic Pro- 
gramming. 
S. Della Pietra, V. Della Pietra, and J. Laf- 
ferty. 1997. Inducing features from random fields. 
IEEE Transactions on Patterns Analysis and Ma- 
chine Intelligence, 19(4). 
E.T. Jaynes. 1957. Information theory and statisti- 
cal mechanics. Physical Review, 108:171-190. 
Andrei Mikheev. forthcoming. Feature lattices and 
maximum entropy models. Journal of Machine 
Learning. 
Marcia Mufioz, Vasin Punyakanok, Dan Roth, and 
Dav Zimak. 1999. A learning approach to shallow 
parsing. In Proceedings of/EMNLP- WVLC'99. 
Adwait Ratnaparkhi. 1997. A linear observed time 
statistical parser based on maximum entropy 
models. In Proceedings of the Second Conference 
on Empirical Methods in Natural Language Pro- 
cessing, Brown University, Providence, Rhode Is- 
land. 
Adwait Ratnaparkhi. 1998. Maximum Entropy 
Models for Natural Language Ambiguity Resolu- 
tion. Ph.D. thesis, UPenn. 
Sven Eric Ristad. 1998. Maximum entropy mod- 
elling toolkit. Technical report. 
Ronald Rosenfeld. 1994. Adaptive Statistical Lan- 
guage Modelling: A Maximum Entropy Approach. 
Ph.D. thesis, Carnegy Mellon University. 
Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. 
Strategy and tactics: a model for  pro- 
duction. In Ninth Conference of the European 
Chapter of the Association for Computational 
Linguistics, University of Bergen, Bergen, Nor- 
way. 
