Maximum Entropy Tagging with Binary and Real-Valued Features
Vanessa Sandrini Marcello Federico Mauro Cettolo
ITC-irst - Centro per la Ricerca Scientifica e Tecnologica
38050 Povo (Trento) - ITALY
{surname}@itc.it
Abstract
Recent literature on text-tagging reported
successful results by applying Maximum
Entropy (ME) models. In general, ME
taggers rely on carefully selected binary
features, which try to capture discrimi-
nant information from the training data.
This paper introduces a standard setting
of binary features, inspired by the litera-
ture on named-entity recognition and text
chunking, and derives corresponding real-
valued features based on smoothed log-
probabilities. The resulting ME models
have orders of magnitude fewer parame-
ters. Effective use of training data to esti-
mate features and parameters is achieved
by integrating a leaving-one-out method
into the standard ME training algorithm.
Experimental results on two tagging tasks
show statistically significant performance
gains after augmenting standard binary-
feature models with real-valued features.
1 Introduction
The Maximum Entropy (ME) statistical frame-
work (Darroch and Ratcliff, 1972; Berger et al.,
1996) has been successfully deployed in several
NLP tasks. In recent evaluation campaigns, e.g.
DARPA IE and CoNLL 2000-2003, ME models
reached state-of-the-art performance on a range of
text-tagging tasks.
With few exceptions, best ME taggers rely on
carefully designed sets of features. Features cor-
respond to binary functions, which model events,
observed in the (annotated) training data and sup-
posed to be meaningful or discriminative for the
task at hand. Hence, ME models result in a log-
linearcombination ofalarge setoffeatures, whose
weights can be estimated by the well known Gen-
eralized Iterative Scaling (GIS) algorithm by Dar-
roch and Ratcliff (1972).
Despite ME theory and its related training algo-
rithm (Darroch and Ratcliff, 1972) do not set re-
strictions on the range of feature functions1, pop-
ular NLP text books (Manning and Schutze, 1999)
and research papers (Berger et al., 1996) seem
to limit them to binary features. In fact, only
recently, log-probability features have been de-
ployed in ME models for statistical machine trans-
lation (Och and Ney, 2002).
This paper focuses on ME models for two text-
tagging tasks: Named Entity Recognition (NER)
and Text Chuncking (TC). By taking inspiration
from the literature (Bender et al., 2003; Borth-
wick, 1999; Koeling, 2000), a set of standard bi-
nary features is introduced. Hence, for each fea-
ture type, a corresponding real-valued feature is
developed in terms of smoothed probability distri-
butions estimated on the training data. A direct
comparison of ME models based on binary, real-
valued, and mixed features is presented. Besides,
performance on the tagging tasks, complexity and
training time by each model are reported. ME es-
timation with real-valued features is accomplished
by combining GIS with the leave-one-out method
(Manning and Schutze, 1999).
Experiments were conducted on two publicly
available benchmarks for which performance lev-
elsofmanysystems arepublished ontheWeb. Re-
sults show that better ME models for NER and TC
can be developed by integrating binary and real-
valued features.
1Darroch and Ratcliff (1972) show how any set of real-
valued feature functions can be properly handled.
1
2 ME Models for Text Tagging
Given a sequence of words wT1 = w1,...,wT and
a set of tags C, the goal of text-tagging is to find
a sequence of tags cT1 = c1,...,cT which maxi-
mizes the posterior probability, i.e.:
ˆcT
1 = argmaxcT
1
p(cT1 | wT1 ). (1)
By assuming a discriminative model, Eq. (1) can
be rewritten as follows:
ˆcT
1 = argmaxcT
1
Tproductdisplay
t=1
p(ct | ct−11 ,wT1 ), (2)
where p(ct|ct−11 ,wT1 ) is the target conditional
probability of tag ct given the context (ct−11 ,wT1 ),
i.e. the entire sequence of words and the full se-
quence of previous tags. Typically, independence
assumptions are introduced in order to reduce the
context size. While this introduces some approxi-
mations in the probability distribution, it consid-
erably reduces data sparseness in the sampling
space. For this reason, the context is limited here
to the two previous tags (ct−1t−2) and to four words
around the current word (wt+2t−2). Moreover, limit-
ing the context to the two previous tags permits to
apply dynamic programming (Bender et al., 2003)
to efficiently solve the maximization (2).
Let y = ct denote the class to be guessed (y ∈ Y)
at time t and x = ct−1t−2,wt+2t−2 its context (x ∈ X).
The generic ME model results:
pλ(y | x) = exp(
summationtextn
i=1 λifi(x,y))summationtext
yprime exp(
summationtextn
i=1 λifi(x,yprime))
. (3)
Thenfeature functions fi(x,y) represent anykind
of information about the event (x,y) which can be
useful for the classification task. Typically, binary
features are employed which model the verifica-
tion of simple events within the target class and
the context.
InMikheev (1998), binary features fortexttagging
are classified into two broad classes: atomic and
complex. Atomic features tell information about
the current tag and one single item (word ortag) of
the context. Complex features result as acombina-
tion of two or more atomic features. In this way, if
the grouped events are not independent, complex
features should capture higher correlations or de-
pendencies, possibly useful to discriminate.
In the following, a standard set of binary fea-
tures is presented, which is generally employed
fortext-tagging tasks. Thereader familiar withthe
topic can directly check this set in Table 1.
3 Standard Binary Features
Binary features are indicator functions ofspecified
events of the sample space X × Y. Hence, they
take value 1 if the event occurs or0 otherwise. For
the sake of notation, the feature name denotes the
type of event, while the index specifies its param-
eters. For example:
Orthperson,Cap,−1(x,y)
corresponds to an Orthographic feature which is
active if and only if the class at time t is person
and the word at time t−1 in the context starts with
capitalized letter.
3.1 Atomic Features
Lexical features These features model co-
occurrences ofclasses andsingle wordsofthecon-
text. Lexical features are defined on a window
of ±2 positions around the current word. Lexical
features are denoted by the nameLexand indexed
with the triple c,w,d which fixes the current class,
i.e. ct = c, the identity and offset of the word in
the context, i.e. wt+d = w. Formally, the feature
is computed by:
Lex c,w,d(x,y) ˆ= δ(ct = c) ·δ(wt+d = w).
For example, the lexical feature for word
Verona, at position t with tag loc (location) is:
Lexloc,Verona,0(x,y) = δ(ct = loc)·
·δ(wt = Verona).
Lexical features might introduce data sparseness
in the model, given that in real texts an impor-
tant fraction of words occur only once. In other
words, many words in the test set will have no
corresponding features-parameter pairs estimated
on the training data. To cope with this problem,
all words observed only once in the training data
were mapped into the special symbol oov.
Syntactic features They model co-occurrences
of the current class with part-of-speech or chunk
tagsofaspecificposition inthecontext. Syntactic
features are denoted by the nameSynand indexed
with a 4-tuple (c,Pos,p,d) or (c,Chnk,p,d),
2
Name Index Definition
Lex c,w,d δ(ct = c) ·δ(wt+d = w),d ∈ Z
Syn c,T,p,d δ(ct = c) ·δ(T(wt+d) = p) , T ∈ {Pos,Chnk},d ∈ Z
Orth c,F,d δ(ct = c) ·F(wt+d) , F ∈ {IsCap,IsCAP},d ∈ Z
Dict c,L,d δ(ct = c) ·InList(L,wt+d),d ∈ Z
Tran c,cprime,d δ(ct = c) ·δ(ct−d = cprime) d ∈ N+
Lex+ c,s,k,ws+k−1s producttexts+k−1d=s Lexc,wd,d(x,y),k ∈ N+,s ∈ Z
Syn+ c,T,s,k,ps+k−1s producttexts+k−1d=s Sync,T,pd,d(x,y),k ∈ N+,s ∈ Z
Orth+ c,F,k,b+k−k δ(ct = c) ·producttextkd=−k δ(Orthc,F,d(x,y) = bd) , bd ∈ {0,1},k ∈ N+
Dict+ c,L,k,b+k−k δ(ct = c) ·producttextkd=−k δ(Dictc,L,d(x,y) = bd) , bd ∈ {0,1},k ∈ N+
Tran+ c,k,ck1 producttextkd=1prime Tranc,cd,d(x,y) k ∈ N+
Table 1: Standard set of binary features for text tagging.
which fixes the class ct, the considered syntactic
information, and the tag and offset within the con-
text. Formally, these features are computed by:
Sync,Pos,p,d(x,y)ˆ=δ(ct = c)·δ(Pos(wt+d) = p)
Sync,Chnk,p,d(x,y)ˆ=δ(ct = c)·δ(Chnk(wt+d) = p).
Orthographic features These features model
co-occurrences of the current class with surface
characteristics of words of the context, e.g. check
if a specific word in the context starts with cap-
italized letter (IsCap) or is fully capitalized
(IsCAP). In this framework, only capitalization
information is considered. Analogously to syntac-
tic features, orthographic features are defined as
follows:
Orthc,IsCap,d(x,y)ˆ=δ(ct = c)· IsCap(wt+d)
Orthc,IsCAP,d(x,y)ˆ=δ(ct = c) ·IsCAP(wt+d).
Dictionary features These features check if
specific positions in the context contain words oc-
curring in some prepared list. This type of feature
results relevant for tasks such as NER, in which
gazetteers of proper names can be used to improve
coverage of the training data. Atomic dictionary
features are defined as follows:
Dictc,L,d(x,y)ˆ=δ(ct = c) · InList(L,wt+d)
where L is a specific pre-compiled list, and
InList is a function which returns 1 if the spec-
ified word matches one of the multi-word entries
of list L, and 0 otherwise.
Transition features Transition features model
Markov dependencies between the current tag and
a previous tag. They are defined as follows:
Tranc,cprime,d(x,y)ˆ=δ(ct = c)·δ(ct−d = cprime).
3.2 Complex Features
More complex events are defined by combining
two or more atomic features in one of two ways.
Product features take the intersection of the cor-
responding atomic events. V ector features con-
sider all possible outcomes of the component fea-
tures.
For instance, the product of 3 atomic Lexical
features, withclass c, offsets −2,−1,0, and words
v−2,v−1,v0, is:
Lex+c,−2,3,v−2,v−1,v0(x,y)ˆ=
0productdisplay
d=−2
Lexc,vd,d(x,y).
Vector features obtained from three Dictionary
features with the same class c, list L, and offsets,
respectively, -1,0,+1, are indexed over all possible
binary outcomes b−1,b0,b1 of the single atomic
features, i.e.:
Dict+c,L,1,b−1,b0,b+1(x,y)ˆ=δ(ct = c)×
1productdisplay
d=−1
δ(Dictc,L,d(x,y) = bd).
Complex features used in the experiments are de-
scribed in Table 1.
The use of complex features significantly in-
creases the model complexity. Assuming that
there are 10,000 words occurring more than once
inthe training corpus, the above lexical feature po-
tentially adds O(|C|1012) parameters!
As complex binary features might result pro-
hibitive from a computational point of view, real-
valued features should be considered as an alter-
native.
3
Feature Index Probability Distribution
Lex d p(ct | wt+d)
Syn T,d p(ct | T(wt+d))
Orth F,d p(ct | F(wt+d))
Dict List,d p(ct | IsIn(List,wt+d))
Tran d p(ct | ct−d)
Lex+ s,k p(ct | wt+s,..,wt+s+k−1
Syn+ T,s,k p(ct | T(wt+s,...,wt+s+k−1))
Orth+ k,F p(ct | F(wt−k),...,F(wt+k))
Dict+ k,L p(ct | InList(L,wt−k),...,InList(L,wt+k))
Tran+ k p(ct | ct−k,...,ct+k))
Table 2: Corresponding standard set of real-values features.
4 Real-valued Features
A binary feature can be seen as a probability mea-
sure with support set made of a single event. Ac-
cording to this point of view, we might easily ex-
tend binary features to probability measures de-
fined over larger event spaces. In fact, it results
convenient to introduce features which are log-
arithms of conditional probabilities. It can be
shown that in this way linear constraints of the
MEmodelcanbeinterpreted intermsofKullback-
Leibler distances between the target model and the
conditional distributions (Klakow, 1998).
Let p1(y|x),p2(y|x),...,pn(y|x) be n different
conditional probability distributions estimated on
the training corpus. In our framework, each con-
ditional probability pi is associated to a feature fi
which is defined over a subspace [X]i × Y of the
sample space X × Y. Hence, pi(y|x) should be
read as a shorthand of p(y | [x]i).
The corresponding real-valued feature is:
fi(x,y) = logpi(y | x). (4)
In this way, the ME in Eq. (3) can be rewritten as:
pλ(y|x) =
producttextn
i pi(y|x)
λi
summationtext
yprime
producttext
i pi(y
prime|x)λi . (5)
According to the formalism adopted in Eq. (4),
real-valued features assume the following form:
fi(ct,ct−1t−2,wt+2t−2) = logpi(ct | ct−1t−2,wt+2t−2). (6)
For each so far presented type of binary feature,
a corresponding real-valued type can be easily de-
fined. The complete list is shown in Table 2. In
general, the context subspace was defined on the
basis of the offset parameters of each binary fea-
ture. For instance, all lexical features selecting
two words at distances -1 and 0 from the current
position t are modeled by the conditional distri-
bution p(ct | wt−1,wt). While distributions of
lexical, syntactic and transition features are con-
ditioned on words or tags, dictionary and ortho-
graphic features are conditioned on binary vari-
ables.
An additional real-valued feature that was em-
ployed is the so called prior feature, i.e. the prob-
ability of a tag to occur:
Prior(x,y) = logp(ct)
A major effect of using real-valued features is
the drastic reduction of model parameters. For
example, each complex lexical features discussed
before introduce just one parameter. Hence, the
small number of parameters eliminates the need
of smoothing the ME estimates.
Real-valued features present some drawbacks.
Their level of granularity, or discrimination, might
result much lower than their binary variants. For
many features, it might result difficult to compute
reliable probability values due to data sparseness.
For the last issue, smoothing techniques devel-
opedforstatistical language modelscanbeapplied
(Manning and Schutze, 1999).
5 Mixed Feature Models
This work, beyond investigating the use of real-
valued features, addresses the behavior of models
combining binary and real-valued features. The
reason is twofold: on one hand, real-valued fea-
tures allow to capture complex information with
fewer parameters; on the other hand, binary fea-
tures permit to keep a good level of granularity
over salient characteristics. Hence, finding a com-
promise between binary and real-valued features
4
might help to develop ME models which better
trade-off complexity vs. granularity of informa-
tion.
6 Parameter Estimation
From the duality of ME and maximum likeli-
hood (Berger et al., 1996), optimal parameters
λ∗ for model (3) can be found by maximizing
the log-likelihood function over a training sample
{(xt,yt) : t = 1,...,N}, i.e.:
λ∗ = argmax
λ
Nsummationdisplay
t=1
logpλ(yt|xt). (7)
Now, whereas binary features take only twovalues
and do not need any estimation phase, conditional
probability features have to be estimated on some
data sample. The question arises about how to ef-
ficiently use the available training data in order to
estimate the parameters and the feature distribu-
tions of the model, by avoiding over-fitting.
Two alternative techniques, borrowed from sta-
tistical language modeling, have been consid-
ered: the Held-out and the Leave-one-out methods
(Manning and Schutze, 1999).
Held-outmethod. Thetraining sampleS issplit
into two parts used, respectively, to estimate the
feature distributions and the ME parameters.
Leave-one-out. ME parameters and feature dis-
tributions are estimated over the same sample S.
The idea is that for each addend in eq. (7), the cor-
responding sample point (xt,yt) is removed from
the training data used to estimate the feature distri-
butions of the model. In this way, it can be shown
that occurrences of novel observations are simu-
lated during the estimation of the ME parameters
(Federico and Bertoldi, 2004).
In ourexperiments, language modeling smooth-
ing techniques (Manning and Schutze, 1999) were
applied to estimate feature distributions pi(y|x).
In particular, smoothing was based on the dis-
counting method in Ney et al. (1994) combined to
interpolation with distributions using less context.
Given the small number of smoothing parameters
involved, leave-one-out probabilities wereapprox-
imated by just modifying count statistics on the
fly (Federico and Bertoldi, 2004). The rationale is
that smoothing parameters do not change signifi-
cantly after removing just one sample point.
For parameter estimation, the GIS algorithm
by Darroch and Ratcliff (1972) was applied. It
is known that the GIS algorithm requires feature
functions fi(x,y) to be non-negative. Hence, fea-
tures were re-scaled as follows:
fi(x,y) = logpi(y|x) + log 1 + epsilon1minp
i
, (8)
where epsilon1 is a small positive constant and the de-
nominator is a constant term defined by:
minpi = min
(x,y)∈S
pi(y|x). (9)
The factor (1 + epsilon1) was introduced to ensure that
real-valued features are always positive. This con-
dition is important to let features reflect the same
behavior of the conditional distributions, which
assign a positive probability to each event.
It is easy to verify that this scaling operation
does not affect the original model but only impacts
on the GIS calculations. Finally, a slack feature
wasintroduced by the algorithm to satisfy the con-
straint that all features sum up to a constant value
(Darroch and Ratcliff, 1972).
7 Experiments
Thissection presents results ofMEmodelsapplied
to two text-tagging tasks, Named Entity Recogni-
tion (NER) and Text Chunking (TC).
After a short introduction to the experimen-
tal framework, the detailed feature setting is pre-
sented. Then, experimental results are presented
for the following contrastive conditions: binary
versus real-valued features, training via held-out
versus leave-one-out, atomic versus complex fea-
tures.
7.1 Experimental Set-up
Named Entity Recognition English NER ex-
periments were carried out on the CoNLL-2003
shared task2. This benchmark is based on texts
from the Reuters Corpus which were manually
annotated with parts-of-speech, chunk tags, and
named entity categories. Four types of categories
are defined: person, organization, location and
miscellaneous, to include e.g. nations, artifacts,
etc. A filler class is used for the remaining words.
After including tags denoting the start of multi-
word entities, a total of 9 tags results. Data are
partitioned into training (200K words), develop-
ment (50K words), and test (46K words) samples.
2Data and results in http://cnts.uia.ac.be/conll2003/ner.
5
Text Chunking English TC experiments were
conducted on the CoNLL-2000 shared task3.
Texts originate from the Wall Street Journal and
areannotated withpart-of-speech tags and chunks.
Thechunk set consists of11 syntactic classes. The
set of tags which also includes start-markers con-
sists of23 classes. Dataissplit into training (210K
words) and test (47K words) samples.
Evaluation Tagging performance of both tasks
is expressed in terms of F-score, namely the har-
monic meanofprecision and recall. Differences in
performance have been statistically assessed with
respect to precision and recall, separately, by ap-
plying a standard test on proportions, with signif-
icance levels α = 0.05 and α = 0.1. Henceforth,
claimed differences in precision or recall will have
their corresponding significance level shown in
parenthesis.
7.2 Settings and Baseline Models
Feature selection and setting for ME models is an
art. In these experiments we tried to use the same
set of features with minor modifications across
both tasks. In particular, used features and their
settings are shown in Table 3.
Training of models with GIS and estimation
of feature distributions used in-house developed
toolkits. Performance of binary feature models
was improved by smoothing features with Gaus-
sian priors (Chen and Rosenfeld, 1999) with mean
zero and standard deviation σ = 4. In general,
tuning ofmodelswascarried outonadevelopment
set.
Most of the comparative experiments were per-
formed on the NER task. Three baseline models
using atomic features Lex, Syn, and Tran were
investigated first: model BaseBin, with all binary
features; modelBaseReal, withallreal-valued fea-
tures plus the prior feature; model BaseMix, with
real-valued Lexand binary Tranand Syn. Mod-
els BaseReal and BaseMix were trained with the
held-out method. In particular, feature distribu-
tions wereestimated onthetraining data whileME
parameters on the development set.
7.3 Binary vs. Real-valued Features
The first experiment compares performance of the
baseline models on the NER task. Experimental
results are summarized in Table 4. Models Base-
Bin, BaseReal, and BaseMix achieved F-scores of
3Data and results in http://cnts.uia.ac.be/conll2000/chunking.
Model ID Num P% R% F-score
BaseBin 580K 78.82 75.62 77.22
BaseReal 10 79.74 74.15 76.84
BaseMix 753 78.90 75.85 77.34
Table 4: Performance of baseline models on the
NER task. Number of parameters, precision, re-
call, and F-score are reported for each model.
Model Methods P% R% F-score
BaseMix Held-Out 78.90 75.85 77.34
BaseMix L-O-O 80.64 76.40 78.46
Table 5: Performance of mixed feature models
with two different training methods.
77.22, 76.84, and 77.34. Statistically meaning-
ful differences were in terms of recall, between
BaseBin and BaseReal (α = 0.1), and between
BaseMix and BaseReal (α = 0.05).
Despite models BaseMix and BaseBin perform
comparably, the former has many fewer parame-
ters, i.e. 753 against 580,000. In fact, BaseMix re-
quires storing and estimating feature distributions,
which is however performed at a marginal compu-
tational cost and off-line with respect to GIS train-
ing.
7.4 Training with Mixed Features
An experiment was conducted with the BaseMix
model to compare the held-out and leave-one-out
training methods. Results in terms of F-score are
reported in Table 5. By applying the leave-one-
out method F-score grows from 77.34 to 78.46,
with a meaningful improvement in recall (α =
0.05). With respect to models BaseBin and Base-
Real, leave-one-out estimation significantly im-
proved precision (α = 0.05).
In terms of training time, ME models with real-
valued features took significantly more GIS iter-
ations to converge. Figures of cost per iteration
and number of iterations are reported in Table 6.
(Computation timesare measured ona single CPU
Pentium-4 2.8GHz.) Memory size of the training
process is instead proportional to the number n of
parameters.
7.5 Complex Features
A final set of experiments aims at comparing the
baseline MEmodels augmented withcomplex fea-
tures, again either binary only (model FinBin),
6
Feature Index NE Task Chunking Task
Lex c,w,d N(w) > 1,−2 ≤ d ≤ +2 −2 ≤ d ≤ +2
Syn c,T,p,d T ∈ {Pos,Chnk},d = 0 T = Pos,−2 ≤ d ≤ +2
Tran c,cprime,d d = −2,−1 d = −2,−1
Lex+ c,s,k,ws+k−1s s = −1,0,k = 1 s = −1,0 k = 1
Syn+ c,T,s,k,ps+k−1s not used s = −1,0 k = 1
Orth+ c,k,F,b+k−k F = {Cap,CAP}, k = 2 F = Cap, k = 1
Dict+ c,k,L,b+k−k k = 3L = {LOC,PER,ORG,MISC} not used
Tran+ c,k,ck1 k = 2 k = 2
Table 3: Setting used for binary and real-valued features in the reported experiments.
Model Single Iteration Iterations Total
BaseBin 54 sec 750 ≈ 11 h
BaseReal 9.6 sec 35,000 ≈ 93 h
BaseMix 42 sec 4,000 ≈ 46 h
Table 6: Computational cost of parameter estima-
tion by different baseline models.
real-valued only (FinReal), or mixed (FinMix).
Results are provided both for NER and TC.
This time, compared models use different fea-
ture settings. In fact, while previous experiments
aimed at comparing the same features, in either
real or binary form, these experiments explore al-
ternatives to a full-fledged binary model. In par-
ticular, real-valued features are employed whose
binary versions would introduce a prohibitively
large number of parameters. Parameter estima-
tion of models including real-valued features al-
ways applies the leave-one-out method.
For the NER task, model FinBin adds Orth+
and Dict+; FinReal adds Lex+, Orth+ and
Dict+; and, FinMix adds real-valued Lex+ and
binary-valued Orth+ and Dict+.
In the TC task, feature configurations are as fol-
lows: FinBin uses Lex, Syn, Tran, and Orth+;
FinReal uses Lex, Syn, Tran, Prior, Orth+,
Lex+, Syn+, Tran+; and, finally, FinMix uses
binary Syn, Tran, Orth+ and real-valued Lex,
Lex+, Syn+.
Performance of the models on the two tasks are
reported in Table 7 and Table 8, respectively.
In the NERtask, all final models outperform the
baseline model. Improvements in precision and
recall are all significant (α = 0.05). Model Fin-
Miximproves precision withrespect to modelFin-
Bin (α = 0.05) and requires two order of magni-
tude fewer parameters.
Model Num P% R% F-score
FinBin 673K 81.92 80.36 81.13
FinReal 19 83.58 74.03 78.07
FinMix 3K 84.34 80.38 82.31
Table 7: Results with complex features on the
NER task.
Model Num P% R% F-score
FinBin 2M 91.04 91.48 91.26
FinReal 19 88.73 90.58 89.65
FinMix 6K 91.93 92.24 92.08
Table 8: Results with complex features on the TC
task.
In the TC task, the same trend is observed.
Again, best performance is achieved by the model
combining binary and real-valued features. In par-
ticular, all observable differences in terms of pre-
cision and recall are significant (α = 0.05).
8 Discussion
Insummary, thispaperaddressed improvements to
ME models for text tagging applications. In par-
ticular, we showed how standard binary features
from theliterature can bemapped intocorrespond-
inglog-probability distributions. MEtraining with
theso-obtained real-valued features can beaccom-
plished by combining the GIS algorithm with the
leave-one-out or held-out methods.
With respect to the best performing systems at
the CoNLL shared tasks, our models exploit a rel-
atively smaller set of features and perform signifi-
cantly worse. Nevertheless, performance achieved
by our system are comparable with those reported
by otherME-based systems taking part inthe eval-
uations.
Extensive experiments on named-entity recog-
7
nition and text chunking have provided support to
the following claims:
• The introduction of real-valued features dras-
tically reduces the number of parameters of
the ME model with a small loss in perfor-
mance.
• The leave-one-out method is significantly
more effective than the held-out method for
training ME models including real-valued
features.
• The combination of binary and real-valued
features can leadtobetterMEmodels. Inpar-
ticular, state-of-the-art ME models with bi-
nary features are significantly improved by
adding complex real-valued features which
model long-span lexical dependencies.
Finally, the GIS training algorithm does not
seem to be the optimal choice for ME models in-
cluding real-valued features. Future work will in-
vestigate variants of and alternatives to the GIS
algorithm. Preliminary experiments on the Base-
Real model showed that training with the Simplex
algorithm (Press et al., 1988) converges to simi-
lar parameter settings 50 times faster than the GIS
algorithm.
9 Acknowledgments
This work was partially financed by the Euro-
pean Commission under the project FAME (IST-
2000-29323), and by the Autonomous Province of
Trento under the the FU-PAT project WebFaq.

References
O. Bender, F. J. Och, and H. Ney. 2003. Maximum
entropy models for named entity recognition. In
Walter Daelemans and Miles Osborne, editors, Pro-
ceedings of CoNLL-2003, pages 148–151. Edmon-
ton, Canada.
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra.
1996. A Maximum Entropy Approach to Natural
Language Processing. Computational Linguistics,
22(1):39–72.
A. Borthwick. 1999. A Maximum Entropy approach
to Named Entity Recognition. Ph.D. thesis, Com-
puter Science Department - New York University,
New York, USA.
S. Chen and R. Rosenfeld. 1999. A Gaussian prior
for smoothing maximum entropy models. Techni-
cal Report CMUCS-99-108, Carnegie Mellon Uni-
versity.
J.N. Darroch and D. Ratcliff. 1972. Generalized Itera-
tive Scaling for Log-Liner models. Annals of Math-
ematical Statistics, 43:1470–1480.
M. Federico and N. Bertoldi. 2004. Broadcast news
lmadaptationovertime. ComputerSpeechand Lan-
guage, 18(4):417–435,October.
D.Klakow. 1998. Log-linearinterpolationoflanguage
models. In Proceedings of the InternationalConfer-
ence of Spoken Language P rocessing (ICSLP), Sid-
ney, Australia.
R. Koeling. 2000. Chunking with maximum entropy
models. In Proceedings of CoNLL-2000, pages
139–141,Lisbon, Portugal.
C. D. Manning and H. Schutze. 1999. Foundations
of Statistical Natural Language Processing. MIT
Press.
A. Mikheev. 1998. Feature lattices for maximum en-
tropy modelling. In COLING-ACL, pages 848–854.
H. Ney, U. Essen, and R. Kneser. 1994. On structur-
ingprobabilisticdependencesin stochastic language
modeling. Computer Speech and Language,8(1):1–
38.
F.J.OchandH.Ney. 2002. Discriminativetrainingand
maximum entropy models for statistical machin e
translation. In ACL02: Proceedings of the 40th An-
nual Meeting of the Association for Computational
Linguistics, pages 295–302,PA, Philadelphia.
W. H. Press, B. P. Flannery, S. A. Teukolsky,and W. T.
Vetterling. 1988. Numerical Recipes in C. Cam-
bridge University Press, New York, NY.
