Feature Lattices for Maximum Entropy Modelling 
Andrei Mikheev* 
HCRC, Language Technology Group, University of Edinburgh, 
2 Buccleuch Place, Edinburgh EH8 9LW, Scotland, UK. 
e-mail: Andrei. Mikheev@ed.ac.uk 
Abstract 
Maximum entropy framework proved to be ex- 
pressive and powerful for the statistical lan- 
guage modelling, but it suffers from the com- 
putational expensiveness of the model build- 
ing. The iterative scaling algorithm that is used 
for the parameter estimation is computation- 
ally expensive while the feature selection pro- 
cess might require to estimate parameters for 
many candidate features many times. In this 
paper we present a novel approach for building 
maximum entropy models. Our approach uses 
the feature collocation lattice and builds com- 
plex candidate features without resorting to it- 
erative scaling. 
1 Introduction 
Maximum entropy modelling has been recently 
introduced to the NLP community and proved 
to be an expressive and powerful framework. 
The maximum entropy model is a model which 
fits a set of pre-defined constraints and assumes 
maximum ignorance about everything which is 
not subject to its constraints thus assigning such 
cases with the most uniform distribution. The 
most uniform distribution will have the entropy 
on its maximum 
Because of its ability to handle overlapping 
features the maximum entropy framework pro- 
vides a principle way to incorporate informa- 
tion from multiple knowledge sources. It is 
superior totraditionally used for this purpose 
linear interpolation and Katz back-off method. 
(Rosenfeld, 1996) evaluates in detail a maxi- 
mum entropy language model which combines 
unigrams, bigrams, trigrams and long-distance 
trigger words, and provides a thorough analysis 
of all the merits of the approach. 
* Now at Harlequin Ltd. 
The iterative scaling algorithm 
(Darroch&Ratcliff, 1972) applied for the pa- 
rameter estimation of maximum entropy mod- 
els computes a set of feature weights (As) which 
ensure that the model fits the reference distri- 
bution and does not make spurious assumptions 
(as required by the maximum entropy principle) 
about events beyond the reference distribution. 
It, however, does not guarantee that the fea- 
tures employed by the model are good features 
and the model is useful. Thus the most im- 
portant part of the model building is the fea- 
ture selection procedure. The key idea of the 
feature selection is that if we notice an interac- 
tion between certain features we should build a 
more complex feature which will account for this 
interaction. The newly added feature should 
improve the model: its Kullback-Leibler diver- 
gence from the reference distribution should de- 
crease and the conditional maximum entropy 
model will also have the greatest log-likelihood 
(L) value: 
The basic feature induction algorithm pre- 
sented in (Della Pietra et al., 1995) starts with 
an empty feature space and iterativety tries all 
possible feature candidates. These candidates 
are either atomic features or complex features 
produced as a combination of an atomic feature 
with the features already selected to the model's 
feature space. For every feature from the can- 
didate feature set the algorithm prescribes to 
compute the maximum entropy model using the 
iterative scaling algorithm described above, and 
select the feature which in the largest way min- 
imizes the Kullback-Leibler divergence or max- 
imizes the log-likelihood of the model. This ap- 
proach, however, is not computationally feasi- 
ble since the iterative scaling is computation- 
ally expensive and to compute models for many 
candidate features many times is unreal. To 
848 
make feature ranking computationally tractable 
in (Della Pietra et al., 1995) and (Berger et al., 
1996) a simplified process proposed: at the fea- 
ture ranking stage when adding a new feature 
to the model, all previously computed parame- 
ters are kept fixed and, thus, we have to fit only 
one new constraint imposed by the candidate 
feature. Then after the best ranked feature has 
been established, it is added to the feature space 
and the weights for all the features are recom- 
puted. This approach estimates good features 
relatively fast but it does not guarantee that at 
every single point we add the best feature be- 
cause when we add a new feature to the model 
all its parameters can change. 
In this paper we present a novel approach to 
feature selection for the maximum entropy mod- 
els. Our approach uses a feature collocation lat- 
tice and selects candidate features without re- 
sorting to the iterative scaling. 
2 Feature Collocation Lattice 
We start the modelling process by building a 
sample space w to train our model on. The sam- 
ple space consists of observed events of interest 
mapped to a set of atomic features T which we 
should define beforehand. Thus every observa- 
tion from the sample space is a binary vector 
of atomic features: if an observation includes 
a certain feature, its corresponding bit in the 
vector is turned on (set to 1) otherwise it is 0. 
When we have a set of atomic features T and 
a training sample of configurations w, we can 
build the feature collocation lattice. Such collo- 
cation lattice will represent, in fact, the factorial 
constraint space (X) for the maximum entropy 
model and at the same time will contain all seen 
and logically implied configurations (w+). For- 
mally, the feature collocation lattice is a 3-ple: 
(0, C_, ~w) where 
0 is a set of nodes of the lattice which corre- 
sponds to the union of the feature space of 
the maximum entropy model and the con- 
figuration space: 0 = XU~(w). In fact, the 
nodes in the lattice (0) can have dual in- 
terpretation - on one hand they can act as 
mapped configurations from the extended 
configuration space (w +) and on the other 
hand they can act as features from the con- 
straint space (X); 
• C_ is a transitive, antisymmetric relation 
over 0 x 0 - a partial ordering. We also will 
need the indicator function to flag whether 
the relation C holds from node i to node k: 
1 il OiC_Ok 
foi(Ok) = 0 otherwise 
~w is a set of configuration frequency counts 
of the nodes (0) of the lattice. This repre- 
sents how many times we saw a particu- 
lar configuration in our training samples. 
Because of the dual interpretation of the 
nodes, a node can also be associated with 
its feature frequency count i.e. the num- 
ber of times we see this feature combina- 
tion anywhere in the lattice. The feature 
frequency of a node will then be ~X(0k) = 
~0~e0 f0k(0i) * ~0~ which is the sum of all 
the configuration frequency counts (~w) of 
the descendant nodes. 
Suppose we have a lattice of nodes 
A,B,\[AB\] with obvious relations: A C_ 
\[AB\]; B C_ \[AB\]: 
A "~ \[AB\] ~, B 
The configuration frequency ~,~ will be the 
number of times we saw A but not \[AB\] 
and then the feature frequency of A will 
be: ~ = ~ + ~,~B i.e. the number of times 
we saw A in all the nodes. 
When we construct the feature collocation 
lattice from a set of samples, each sample repre- 
sents a feature configuration which we must add 
to the lattice as its node (Ok). To support gener- 
alizations over the domain we also want to add 
to the lattice the nodes which are shared parts 
with other nodes in the lattice. Thus we add 
to the lattice all sub-configurations of a newly 
added configuration which are the intersections 
with the other nodes. We increment the con- 
figuration frequency (~) of a node each time 
we see in the training samples this particular 
configuration in full. For example, if a config- 
uration \[ABCD\] comes from a training sam- 
ple and it is still not in the lattice, we create 
a node \[ABCD\] and set its configuration fre- 
quency ~\[~ABCD\] to 1. If by that time there is a 
node \[ABDE\] in the lattice, we then also create 
849 
the node \[ABD\], relate it to the nodes \[ABCD\] 
and \[ABDE\] and set its configuration frequency 
to 0. If \[ABCD\] had already existed in the lat- 
tice, we would simply incremented its configu- 
ration frequency: ~\[WABCD \] ~- ~\[WABcD \] + 1. 
Thus in the feature lattice we have nodes with 
non-zero configuration frequencies, which we 
call reference nodes and nodes with zero config- 
uration frequencies which we call latent or hid- 
den nodes. Reference nodes actually represent 
the observed configuration space (w). Hidden 
nodes are never observed on their own but only 
as parts of the reference nodes and represent 
possible generalizations about domain: low- 
complexity constraints (X) and logically possi- 
ble configurations (w+). 
This method of building the feature colloca- 
tion lattice ensures that along with true obser- 
vations it contains hidden nodes which can pro- 
vide generalizations about the domain. At the 
same time there is no over-generation of the hid- 
den nodes: no logically impossible feature com- 
binations and no hidden nodes without general- 
ization power are included. 
3 Feature Selection 
After we constructed from a set of samples the 
feature collocation lattice (0, C_,(~), which we 
will call the empirical lattice, we try to esti- 
mate which features contribute and which do 
not to the frequency distribution on the refer- 
ence nodes. Thus only the predictive features 
will be retained in the lattice. The optimized 
feature space can be seen as a feature lattice de- 
fined over the empirical feature lattice: 0' C_ 0 
and initially it is empty: 0' =j0. We build the 
optimized lattice by incrementally adding a fea- 
ture (atomic or complex) from the empirical lat- 
tice, together with the nodes which are the min- 
imal collocations of this feature with the nodes 
already included into the optimized lattice. The 
necessity to add the collocations comes from the 
fact that the features (or nodes) can overlap 
with each other and we want to have a unique 
node for such overlap. So if in the optimized 
feature lattice there is just one feature A, then 
when we add the feature B we also have to add 
the collocation \[AB\] if it exists in the empirical 
lattice. The configuration frequency of a node 
in the optimized lattice ((,w) then can be corn- 
puted as: 
tW__ 
(1) 
Thus a node in the optimized lattice takes all 
configuration frequencies ((w) of itself and the 
above related nodes if these nodes do not belong 
to the optimized lattice themselves and there is 
no higher node in the optimized lattice related 
to them. 
Figure i shows how the configuration frequen- 
cies in the optimized lattice are redistributed 
when adding a new feature. First the lat- 
tice is empty. When we add the feature A 
to the optimized lattice (Figure 1.a), because 
no other features are present in the optimized 
lattice, it takes all the configuration frequen- 
cies of the nodes where we see the feature A: 
~ = ~ + ~B + ~C + ~BC" Case b) of Fig- 
ure 1 represents the situation when we add the 
feature B to the optimized lattice which already 
includes the feature A. Apart from the node B 
we also add the collocation of the nodes A and 
B to the optimized lattice. Now we have to re- 
distribute the configuration frequencies in the 
optimized lattice. The configuration frequency 
of the node A now will become the number of 
times of seeing the feature A but not the fea- 
ture combination AB: ('A w = ~1 + ~C" The 
configuration frequency of the node B will be 
the number of times of seeing the node B but 
not the nodeAB: ~ = ~ + w (BC" The con- 
figuration frequency of the node AB will be: 
~B = %ABCW -b %ABC'¢W When we add the feature C 
to the optimized lattice (Figure 1.c) we produce 
a fully saturated lattice identical to the empiri- 
cal lattice, since the node C will collocate with 
the node A producing AC and will collocate 
with the node B producing BC. These nodes 
in their turn will collocate with each other and 
with the node AB producing the node ABC. 
During the optimized lattice construction all 
the features (atomic and complex) from the em- 
pirical lattice compete, and we include the one 
which results in a optimized lattice with the 
smallest divergence D(p \[I P') and equation ??) 
and therefore with the greatest log-likelihood 
Lp(p') , where: 
• p(Oi) is the probability for the i-th node in 
850 
a) 
~'~' = ¢~ + ,'.~ + ¢~.c + ~'~.~c 
b) c) 
~ = ¢~ + ~\]c ~ = ~\] ~ = ~ = +~ ~= ,~=~ 
(~c = ~c ~T~c 
= ~c 
Figure 1: This figure shows the redistribution of the configuration frequencies in the optimized feature lattice when 
adding new nodes. Case a) stands for adding the feature A to the empty lattice, case b) stands for adding the feature 
B to the lattice with the feature A and case c) stands for adding the feature C to the lattice with the atomic features 
A and B and their collocations. The unfilled nodes stand for the nodes in the empirical lattice which don't have 
reference in the optimized lattice. The nodes in bold stand for the nodes decided by the optimized lattice (i.e. they 
can be assigned with non-default probabilities). 
the empirical lattice: 
p(Oi)=--~°~ where N= ~_, ~ (2) N 
o~eo 
• p'(Si) is the probability assigned to the i-th 
node using only the nodes included into the 
optimized lattice. 
but there is just one undecided node (C) which 
is not shown in bold. So the probabilities for 
the nodes will be: 
I%~ %¢t.u./ ¢iw 
=/(A) /(ec) = p'(B) 
/(ABC) =/(AB) /(C) -- ~ 
N is the total count on the empirical lattice and 
{ ~_~_. . -, is calculated as shown in equation 2: 
~f u~Eu 
p'(Oi) = ~ '/ O, ¢ O' & N = ~.~ + ~ + ~ + ~B + ~'~C + ~'~BC" 
\[30~: oj e 0' & 0r c o,\] & \[~0k : 0k e 0' & "e~ c 0~ & 0j c 8k\] The presented above method provides us with 
1/\]Y \] oth~,.,,ise - an efficient way of selecting only important fea- 
(3) 
The optimized lattice assigns the probabil- 
ity to a node in the empirical lattice equal 
to that of its most specific sub-node from 
the optimized lattice. For reference nodes 
which do not have sub-nodes in the opti- 
mized lattice at all (undecided nodes) ac- 
cording to the maximum entropy principle 
we assign the uniform probability of mak- 
ing an arbitrary prediction. 
For instance, for the example on Figure 1.b 
the optimized lattice includes only three nodes 
tures from the initial set of candidate features 
without resorting to iterative scaling. When 
this way we add the features to the optimized 
lattice some candidate features might not suf- 
ficiently contribute to the probability distribu- 
tion on the lattice. For instance, in the example 
presented on Figure 1, after we added the fea- 
ture \[B\] (case b) the only remaining undecided 
node was IV\]. If the node \[C\] is truly hidden (i.e. 
it does not have its own observation frequency) 
and all other nodes are optimally decided, there 
is no point to add the node \[C\] into the lattice 
and instead of having 9 nodes we will have only 
851 
3. Another consideration which we apply during 
the lattice building is to penalize the develop- 
ment of low frequency (but not zero frequency) 
nodes i.e. the nodes with no reliable statistics 
on them. Thus we smooth the estimates on such 
nodes with the uniform distribution (which has 
the entropy on its maximum): 
p"(Oi) = L * ~ + (1 - L) * p'(Oi) where L = 
THRESHOLD THRESHOLD+~'O~ 
So for high frequency nodes this smoothing is 
very minor but for nodes with frequencies less 
than two thresholds the penalty will be consid- 
erable. This will favor nodes which do not cre- 
ate sparce collocations with other nodes. 
The described method is similar in spirit to 
the method of word trigger incorporation to a 
trigram model suggested in (Rosenfeld, 1996): 
if a trigram predicts well enough there is no 
need for an additional trigger. The main differ- 
ence is that we do not recompute the maximum 
entropy model every time but use our own fre- 
quency redistribution method over the colloca- 
tion lattice. This is the crucial difference which 
makes a tremendous saving in time. We also 
do not require a newly added feature to be ei- 
ther atomic or a collocation of an atomic feature 
with a feature already included into the model 
as it was proposed in (Della Pietra et al., 1995) 
(Berger et al., 1996). All the features are cre- 
ated equal and the model should decide on the 
level of granularity by itself. 
4 Model Generalization 
After we have chosen a subset of features for our 
model, we restrict our feature lattice to the op- 
timized lattice. Now we can compute the max- 
imum entropy model taking the reference prob- 
abilities (which are configuration probabilities) 
as in equation 3. 
The nodes from the optimized lattice serve 
both as possible domain configurations and as 
potential constraint features to our model. We, 
however, want to constrain only the nodes with 
the reliable statistics on them in order not to 
overfit the model. This in its turn will take off 
certain computational load, since we expect a 
considerable number of fragmented (simply in- 
frequent) nodes in the optimized lattice. This 
comes from the requirement to build all the col- 
locations when we add a new node. Although 
many top-level nodes will not be constrained, 
the information from such infrequent nodes will 
not be lost completely - it will contribute to 
more general nodes since for every constrained 
node we marginalize Over all its unconstrained 
descendants (more specific nodes). Thus as 
possible constraints for the model we will con- 
sider only those nodes from the optimized lat- 
tice, whose marginalized over responses feature 
frequency counts I are greater than a certain 
threshold, e.g.: ~0x__ ~ y > 5. This considera- =(, ) 
tion is slightly different from the one suggested 
in (Ristad, 1996) where it was proposed to un- 
constrain nodes with infrequent joint feature 
frequency counts. Thus if we saw a certain fea- 
ture configuration say 5,000 times and it always 
gave a single response we suggest to constrain 
as well the observation that we never saw this 
configuration with the other responses. If we 
applied the suggestion of (Ristad, 1996) and 
cut out on the basis of the joint frequency we 
would lose the negative evidence, which is quite 
reliable judging by the total frequency of the 
observation. 
Initially we constrain all the nodes which sat- 
isfy the above requirement. In order to gener- 
alize and simplify our maximum entropy model, 
we uncgnstrain the most specific features, com- 
pute a new simplified maximum entropy model, 
and if it still predicts well, we repeat the pro- 
cess. So our aim is to remove from the con- 
straints as many top level nodes as possible 
without losing the model fitness to the refer- 
ence distribution (15) of the optimized feature 
lattice. The necessary condition for a node to 
be taken as a candidate to unconstrain, is that 
this node shouldn't have any constrained nodes 
above it. There is also a natural ranking for 
the candidate nodes: the closer to 1 the weight 
(),) of a such a node is, the less it is important 
for the model. We can set a certain thresh- 
old on the weights, so all the candidate nodes 
whose As differ from 1 less than this threshold 
will be unconstrained in one go. Therefore we 
don't have to use the iterative scaling for feature 
ranking and apply it only for model recompu- 
tation, possibly un-constraining several feature 
configurations (nodes) at once. This method, in 
fact, resembles the Backward Sequential Search 
1~'~(Ok) = ~o,~o, Io, (o~) • gg 
852 
(BSS) proposed in (Pedersen&Bruce, 1997) for 
decomposable models. There is also a sig- 
nificant reduction in computational load since 
the generalized smaller model deviates from the 
previous larger model only in a small number of 
constraints. So we use the parameters of that 
larger model 2 as the initial values for the itera- 
tive scaling algorithm. This proved to decrease 
the number of required iterations by about ten- 
fold, which makes a tremendous saving in time. 
There can be many possible criteria when to 
stop the generalization algorithm. The sim- 
plest one is just to set a predefined threshold 
on the deviation D(fi II P) of the generalized 
model from the reference distribution. (Peder- 
sen&Bruce, 1997) suggest to use Akaike's Infor- 
mation Criteria (AIC) to judge the acceptabil- 
ity of a new model. AIC rewards good model fit 
and penalizes models with high complexity mea- 
sured in the number of features. We adopted 
the stop condition suggested in (Berger et al., 
1996) - the maximization of the likelihood on a 
cross-validation set of samples which is unseen 
at the parameter estimation. 
5 Application: Fullstop Problem 
Sentence boundary disambiguation has recently 
gained certain attention of the language engi- 
neering community. It is required for most text 
processing tasks such as, tagging, parsing, par- 
allel corpora alignment etc., and, as it turned 
out to be, this is a non-trivial task itself. A 
period can act as the end of a sentence or be 
a part of an abbreviation, but when an abbre- 
viation is the last word in a sentence, the pe- 
riod denotes the end of a sentence as well. The 
simplest "period-space-capital_letter" approach 
works well for simple texts but is rather unre- 
liable for texts with many proper names and 
abbreviations at the end of sentence as, for in- 
stance, the Wall Street Journal (WSJ) corpus ( 
(Marcus et al., 1993) ). 
One well-known trainable systems - SATZ 
- is described in (Palmer&Hearst, 1997). It 
uses a neural network with two layers of hid- 
den units. It was trained on the most prob- 
able parts-of-speech of three words before and 
three words after the period using 573 samples 
from the WSJ corpus. It was then tested on 
2instead of the uniform distribution as prescribed in 
the step 1 of the Improved Iterative Scaling algorithm. 
853 
unseen 27,294 sentences from the same corpus 
and achieved 1.5% error rate. Another auto- 
matically trainable system described in (Rey- 
nar&Ratnaparkhi, 1997). This system is sim- 
ilar to ours in the model choice - it uses the 
maximum entropy framework. It was trained 
on two different feature sets and scored 1.2% 
error rate on the corpus tuned feature set and 
2% error rate on a more portable feature set. 
The features themselves were words and their 
classes in the immediate context of the period 
mark. (Reynar&Ratnaparkhi, 1997) don't re- 
port on the number of features utilized by their 
model and don't describe their approach to fea- 
ture selection but judging by the time their sys- 
tem was trained (18 minutes 3) it did not aim 
to produce the best performing feature-set but 
estimated a given one. 
To tackle this problem we applied our method 
to a maximum entropy model which used a 
lexicon of words associated with one or more 
categories from the set: abbreviation, proper 
noun, content word, closed-class word. This 
model employed atomic features such as the lex- 
icon information for the words before and after 
the period, their capitalization and spellings. 
For training we collected from the WSJ cor- 
pus 51,000 samples of the form (Y, F..F) and (N, F..F), 
where Y stands for the end of sen- 
tence, N stands for otherwise and Fs stand for 
the atomic features of the model. We started to 
built the model with 238 most frequent atomic 
features which gave us the collocation lattice of 
8,245 nodes in 8 minutes of processor time on 
five SUN Ultra-1 workstations working in par- 
allel by means of multi-threading and Remote 
Process Communication. When we applied the 
feature selection algorithm (section 3), we in 53 
minutes boiled the lattice down to 769 nodes. 
Then constraining all the nodes, we compiled 
a maximum entropy model in about 15 minutes 
and then using the constraint removal process in 
two hours we boiled the constraint space down 
to 283. In this set only 31 atomic features re- 
mained. This model was detected to achieve the 
best performance on a specified cross-validation 
set. For the evaluation we used the same 27,294 
sentences as in (Palmer&Hearst, 1997) 4 which 
aPersonal communication 
4We would like to thank David Palmer for making his 
test data available to us. 
were also used by (Reynar&Ratnaparkhi, 1997) 
in the evaluation of their system. These sen- 
tences, of course, were not seen at the train- 
ing phase of our model. Our model achieved 
99,2477% accuracy which is the highest quoted 
score on this test-set known to the authors. 
We attribute this to the fact that although we 
started with roughly the same atomic features 
as (Reynar&Ratnaparkhi, 1997) our system 
created complex features with higher prediction 
power. 
6 Conclusion 
In this paper we presented a novel approach for 
building maximum entropy models. Our ap- 
proach uses a feature collocation lattice and se- 
lects the candidate features without resorting 
to iterative scaling. Instead we use our own 
frequency redistribution algorithm. After the 
candidate features have been selected we, us- 
ing the iterative scaling, compute a fully satu- 
rated model for the maximal constraint space 
and then apply relaxation to the most specific 
constraints. 
We applied the described method to sev- 
eral language modelling tasks such as sentence 
boundary disambiguation, part-of-speech tag- 
ging, stress prediction in continues speech gen- 
eration, etc., and proved its feasibility for select- 
ing and building the models with the complex- 
ity of tens of thousands constraints. We see the 
major achievement of our method in building 
compact models with only a fraction of possi- 
ble features (usually there is a few hundred fea- 
tures) and at the same time performing at least 
as good as state-of-the-art: in fact, our sen- 
tence boundary disambiguater scored the high- 
est known to the author accuracy (99.2477%) 
and our part-of-speech tagging model general- 
ized for a new domain with only a tiny degra- 
dation in performance. 
A potential drawback of our approach is that 
we require to build the feature collocation lat- 
tice for the whole observed feature-space which 
might not be feasible for applications with hun- 
dreds of thousands of features. So one of the 
directions in our future work is to find effi- 
cient ways for a decomposition of the feature 
lattice into non-overlapping sub-lattices which 
then can be handled by our method. Another 
avenue for further improvement is to introduce 
854 
the "or" operation on the nodes of the lattice. 
This can provide a further generalization over 
the features employed by the model. 
7 Acknowledgements 
The work reported in this paper was supported 
in part by grant GR/L21952 (Text Tokenisa- 
tion Tool) from the Engineering and Physical 
Sciences Research Council, UK. We would also 
like to acknowledge that this work was based on 
a long-standing collaborative relationship with 
Steve Finch. 

References 
A. Berger, S. Della Pietra, V. Della Pietra, 
1996. A Maximum Entropy Approach to Nat- 
ural Language Processing In Computational 
Linguistics vol.22(1) 
J.N. Darroch and D. Ratcliff 1972. Generalized 
Iterative Scaling for Log-Linear Models. The 
Annals of Mathematical Statistics, 43(5). 
S. Della Pietra, V.. Della Pietra, and J. Lafferty 
1995. Inducing Features of Random Fields 
Technical report CMU-CS-95-144 
M. Marcus, M.A. Marcinkiewicz, and B. San- 
torini 1993. Building a Large Annotated Cor- 
pus of English: The Penn Treebank. In Com- 
putational Linguistics, vol 19(2), ACL. 
D. D. Palmer and M. A. Hearst 1997. Adaptive 
Multilingual Sentence Boundary Disambigua- 
tion. In Computational Linguistics, vol 23(2), 
ACL. pp. 241-269 
T. Pedersen and R. Bruce 1997. A New Su- 
pervised Learning Algorithm for Word Sense 
Disambiguation. In Proceedings of the Four- 
teenth National Conference on Artificial In- 
telligence, Providence, RI. 
J. C. Reynar and A. Ratnaparkhi 1997. A 
Maximum Entropy Approach to Identifying 
Sentence Boundaries. In Proceedings of the 
Fifth A CL Conference on Applied Natural 
Language Processing (ANLP'97), Washing- 
ton D.C., ACL. 
E. S. Ristad 1996. Maximum Entropy Mod- 
elling Toolkit. Documentation for Version 
1.3 Beta, Draft, 
R. Rosenfeld 1996. A Maximum Entropy 
Approach to Adaptive Statistical Language 
Learning. In Computer Speech and Language, 
vol.10(3), Academic Press Limited, pp. 197-228 
