Confidence Estimation for Information Extraction
Aron Culotta
Department of Computer Science
University of Massachusetts
Amherst, MA 01003
culotta@cs.umass.edu
Andrew McCallum
Department of Computer Science
University of Massachusetts
Amherst, MA 01003
mccallum@cs.umass.edu
Abstract
Information extraction techniques automati-
cally create structured databases from un-
structured data sources, such as the Web or
newswire documents. Despite the successes of
these systems, accuracy will always be imper-
fect. For many reasons, it is highly desirable to
accurately estimate the confidence the system
has in the correctness of each extracted field.
The information extraction system we evalu-
ate is based on a linear-chain conditional ran-
dom field (CRF), a probabilistic model which
has performed well on information extraction
tasks because of its ability to capture arbitrary,
overlapping features of the input in a Markov
model. We implement several techniques to es-
timate the confidence of both extracted fields
and entire multi-field records, obtaining an av-
erage precision of 98% for retrieving correct
fields and 87% for multi-field records.
1 Introduction
Information extraction usually consists of tagging a se-
quence of words (e.g. a Web document) with semantic
labels (e.g. PERSONNAME, PHONENUMBER) and de-
positing these extracted fields into a database. Because
automated information extraction will never be perfectly
accurate, it is helpful to have an effective measure of
the confidence that the proposed database entries are cor-
rect. There are at least three important applications of
accurate confidence estimation. First, accuracy-coverage
trade-offs are a common way to improve data integrity in
databases. Efficiently making these trade-offs requires an
accurate prediction of correctness.
Second, confidence estimates are essential for inter-
active information extraction, in which users may cor-
rect incorrectly extracted fields. These corrections are
then automatically propagated in order to correct other
mistakes in the same record. Directing the user to
the least confident field allows the system to improve
its performance with a minimal amount of user effort.
Kristjannson et al. (2004) show that using accurate con-
fidence estimation reduces error rates by 46%.
Third, confidence estimates can improve performance
of data mining algorithms that depend upon databases
created by information extraction systems (McCallum
and Jensen, 2003). Confidence estimates provide data
mining applications with a richer set of “bottom-up” hy-
potheses, resulting in more accurate inferences. An ex-
ample of this occurs in the task of citation co-reference
resolution. An information extraction system labels each
field of a paper citation (e.g. AUTHOR, TITLE), and then
co-reference resolution merges disparate references to the
same paper. Attaching a confidence value to each field
allows the system to examine alternate labelings for less
confident fields to improve performance.
Sound probabilistic extraction models are most con-
ducive to accurate confidence estimation because of their
intelligent handling of uncertainty information. In this
work we use conditional random fields (Lafferty et al.,
2001), a type of undirected graphical model, to automat-
ically label fields of contact records. Here, a record is an
entire block of a person’s contact information, and a field
is one element of that record (e.g. COMPANYNAME). We
implement several techniques to estimate both field con-
fidence and record confidence, obtaining an average pre-
cision of 98% for fields and 87% for records.
2 Conditional Random Fields
Conditional random fields (Lafferty et al., 2001) are undi-
rected graphical models to calculate the conditional prob-
ability of values on designated output nodes given val-
ues on designated input nodes. In the special case in
which the designated output nodes are linked by edges in
a linear chain, CRFs make a first-order Markov indepen-
dence assumption among output nodes, and thus corre-
spond to finite state machines (FSMs). In this case CRFs
can be roughly understood as conditionally-trained hid-
den Markov models, with additional flexibility to effec-
tively take advantage of complex overlapping features.
Let o = 〈o1,o2,...oT〉 be some observed input data se-
quence, such as a sequence of words in a document (the
values on T input nodes of the graphical model). Let S be
a set of FSM states, each of which is associated with a la-
bel (such as COMPANYNAME). Let s = 〈s1,s2,...sT〉 be
some sequence of states (the values on T output nodes).
CRFs define the conditional probability of a state se-
quence given an input sequence as
pΛ(s|o) = 1Z
o
exp
parenleftBigg Tsummationdisplay
t=1
summationdisplay
k
λkfk(st−1,st,o,t)
parenrightBigg
,
(1)
where Zo is a normalization factor over all state se-
quences, fk(st−1,st,o,t) is an arbitrary feature func-
tion over its arguments, and λk is a learned weight for
each feature function. Zo is efficiently calculated using
dynamic programming. Inference (very much like the
Viterbi algorithm in this case) is also a matter of dynamic
programming. Maximum aposteriori training of these
models is efficiently performed by hill-climbing methods
such as conjugate gradient, or its improved second-order
cousin, limited-memory BFGS.
3 Field Confidence Estimation
The Viterbi algorithm finds the most likely state sequence
matching the observed word sequence. The word that
Viterbi matches with a particular FSM state is extracted
as belonging to the corresponding database field. We can
obtain a numeric score for an entire sequence, and then
turn this into a probability for the entire sequence by nor-
malizing. However, to estimate the confidence of an indi-
vidual field, we desire the probability of a subsequence,
marginalizing out the state selection for all other parts
of the sequence. A specialization of Forward-Backward,
termed Constrained Forward-Backward (CFB), returns
exactly this probability.
Because CRFs are conditional models, Viterbi finds
the most likely state sequence given an observation se-
quence, defined as s∗ = argmaxs pΛ(s|o). To avoid an
exponential-time search over all possible settings of s,
Viterbi stores the probability of the most likely path at
time t that accounts for the first t observations and ends
in state si. Following traditional notation, we define this
probability to be δt(si), where δ0(si) is the probability of
starting in each state si, and the recursive formula is:
δt+1(si) = max
sprime
bracketleftBig
δt(sprime)exp
parenleftBigsummationdisplay
k
λkfk(sprime,si,o,t)
parenrightBigbracketrightBig
(2)
terminating in s∗ = argmax
s1≤si≤sN
[δT(si)].
The Forward-Backward algorithm can be viewed as a
generalization of the Viterbi algorithm: instead of choos-
ing the optimal state sequence, Forward-Backward eval-
uates all possible state sequences given the observation
sequence. The “forward values” αt+1(si) are recursively
defined similarly as in Eq. 2, except the max is replaced
by a summation. Thus we have
αt+1(si) =
summationdisplay
sprime
bracketleftBig
αt(sprime)exp
parenleftBigsummationdisplay
k
λkfk(sprime,si,o,t)
parenrightBigbracketrightBig
.
(3)
terminating in Zo =summationtexti αT(si) from Eq. 1.
To estimate the probability that a field is extracted
correctly, we constrain the Forward-Backward algorithm
such that each path conforms to some subpath of con-
straints C = 〈sq ...sr〉 from time step q to r. Here,
sq ∈ C can be either a positive constraint (the sequence
must pass through sq) or a negative constraint (the se-
quence must not pass through sq).
In the context of information extraction, C corresponds
to an extracted field. The positive constraints specify the
observation tokens labeled inside the field, and the neg-
ative constraints specify the field boundary. For exam-
ple, if we use states names B-TITLE and I-JOBTITLE to
label tokens that begin and continue a JOBTITLE field,
and the system labels observation sequence 〈o2,...,o5〉
as a JOBTITLE field, then C = 〈s2 = B-JOBTITLE,
s3 = ... = s5 = I-JOBTITLE,s6 negationslash= I-JOBTITLE〉.
The calculations of the forward values can be made to
conform to C by the recursion αprimeq(si) =
braceleftBiggP
sprime
h
αprimeq−1(sprime)exp
“P
k λkfk(s
prime,si,o,t)
”i
if si similarequal sq
0 otherwise
for all sq ∈ C, where the operator si similarequal sq means si
conforms to constraint sq. For time steps not constrained
by C, Eq. 3 is used instead.
If αprimet+1(si) is the constrained forward value, then
Zprimeo = summationtexti αprimeT(si) is the value of the constrained lat-
tice, the set of all paths that conform to C. Our confi-
dence estimate is obtained by normalizing Zprimeo using Zo,
i.e. Zprimeo − Zo.
We also implement an alternative method that uses the
state probability distributions for each state in the ex-
tracted field. Let γt(si) = p(si|o1,...,oT) be the prob-
ability of being in state i at time t given the observation
sequence . We define the confidence measure GAMMA
to be producttextvi=u γi(si), where u and v are the start and end
indices of the extracted field.
4 Record Confidence Estimation
We can similarly use CFB to estimate the probability that
an entire record is labeled correctly. The procedure is
the same as in the previous section, except that C now
specifies the labels for all fields in the record.
We also implement three alternative record confidence
estimates. FIELDPRODUCT calculates the confidence of
each field in the record using CFB, then multiplies these
values together to obtain the record confidence. FIELD-
MIN instead uses the minimum field confidence as the
record confidence. VITERBIRATIO uses the ratio of the
probabilities of the top two Viterbi paths, capturing how
much more likely s∗ is than its closest alternative.
5 Reranking with Maximum Entropy
We also trained two conditional maximum entropy clas-
sifiers to classify fields and records as being labeled cor-
rectly or incorrectly. The resulting posterior probabil-
ity of the “correct” label is used as the confidence mea-
sure. The approach is inspired by results from (Collins,
2000), which show discriminative classifiers can improve
the ranking of parses produced by a generative parser.
After initial experimentation, the most informative in-
puts for the field confidence classifier were field length,
the predicted label of the field, whether or not this field
has been extracted elsewhere in this record, and the CFB
confidence estimate for this field. For the record confi-
dence classifier, we incorporated the following features:
record length, whether or not two fields were tagged with
the same label, and the CFB confidence estimate.
6 Experiments
2187 contact records (27,560 words) were collected from
Web pages and email and 25 classes of data fields were
hand-labeled.1 The features for the CRF consist of the
token text, capitalization features, 24 regular expressions
over the token text (e.g. CONTAINSHYPHEN), and off-
sets of these features within a window of size 5. We also
use 19 lexicons, including “US Last Names,” “US First
Names,” and “State Names.” Feature induction is not
used in these experiments. The CRF is trained on 60% of
the data, and the remaining 40% is split evenly into de-
velopment and testing sets. The development set is used
to train the maximum entropy classifiers, and the testing
set is used to measure the accuracy of the confidence es-
timates. The CRF achieves an overall token accuracy of
87.32 on the testing data, with a field-level performance
of F1 = 84.11, precision = 85.43, and recall = 82.83.
To evaluate confidence estimation, we use three meth-
ods. The first is Pearson’s r, a correlation coefficient
ranging from -1 to 1 that measures the correlation be-
tween a confidence score and whether or not the field
(or record) is correctly labeled. The second is average
precision, used in the Information Retrieval community
1The 25 fields are: FirstName, MiddleName, LastName,
NickName, Suffix, Title, JobTitle, CompanyName, Depart-
ment, AddressLine, City1, City2, State, Country, PostalCode,
HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mo-
bile, Pager, VoiceMail, URL, Email, InstantMessage
Pearson’s r Avg. Prec
CFB .573 .976
MaxEnt .571 .976
Gamma .418 .912
Random .012 .858
WorstCase – .672
Table 1: Evaluation of confidence estimates for field confi-
dence. CFB and MAXENT outperform competing methods.
Pearson’s r Avg. Prec
CFB .626 .863
MaxEnt .630 .867
FieldProduct .608 .858
FieldMin .588 .843
ViterbiRatio .313 .842
Random .043 .526
WorstCase – .304
Table 2: Evaluation of confidence estimates for record confi-
dence. CFB, MAXENT again perform best.
to evaluate ranked lists. It calculates the precision at
each point in the ranked list where a relevant document
is found and then averages these values. Instead of rank-
ing documents by their relevance score, here we rank
fields (and records) by their confidence score, where a
correctly labeled field is analogous to a relevant docu-
ment. WORSTCASE is the average precision obtained
by ranking all incorrect instances above all correct in-
stances. Tables 1 and 2 show that CFB and MAXENT are
statistically similar, and that both outperform competing
methods. Note that WORSTCASE achieves a high aver-
age precision simply because so many fields are correctly
labeled. In all experiments, RANDOM assigns confidence
values chosen uniformly at random between 0 and 1.
The third measure is an accuracy-coverage graph. Bet-
ter confidence estimates push the curve to the upper-right.
Figure 1 shows that CFB and MAXENT dramatically out-
perform GAMMA. Although omitted for space, similar
results are also achieved on a noun-phrase chunking task
(CFB r = .516, GAMMA r = .432) and a named-entity
extraction task (CFB r = .508, GAMMA r = .480).
7 Related Work
While there has been previous work using probabilistic
estimates for token confidence, and heuristic estimates
for field confidence, to the best of our knowledge this pa-
per is the first to use a sound, probabilistic estimate for
confidence of multi-word fields and records in informa-
tion extraction.
Much of the work in confidence estimation
for IE has been in the active learning literature.
Scheffer et al. (2001) derive confidence estimates using
hidden Markov models in an information extraction
system. However, they do not estimate the confidence
of entire fields, only singleton tokens. They estimate
 0.84
 0.86
 0.88
 0.9
 0.92
 0.94
 0.96
 0.98
 1
 0  0.2  0.4  0.6  0.8  1
accuracy
coverage
"Optimal"
"CFB"
"MaxEnt"
"Gamma"
"Random"
Figure 1: The precision-recall curve for fields shows that CFB
and MAXENT outperform GAMMA.
the confidence of a token by the difference between
the probabilities of its first and second most likely
labels, whereas CFB considers the full distribution of
all suboptimal paths. Scheffer et al. (2001) also explore
an idea similar to CFB to perform Baum-Welch training
with partially labeled data, where the provided labels
are constraints. However, these constraints are again for
singleton tokens only.
Rule-based extraction methods (Thompson et al.,
1999) estimate confidence based on a rule’s coverage in
the training data. Other areas where confidence estima-
tion is used include document classification (Bennett et
al., 2002), where classifiers are built using meta-features
of the document; speech recognition (Gunawardana et al.,
1998), where the confidence of a recognized word is esti-
mated by considering a list of commonly confused words;
and machine translation (Gandrabur and Foster, 2003),
where neural networks are used to learn the probability of
a correct word translation using text features and knowl-
edge of alternate translations.
8 Conclusion
We have shown that CFB is a mathematically and empir-
ically sound confidence estimator for finite state informa-
tion extraction systems, providing strong correlation with
correctness and obtaining an average precision of 97.6%
for estimating field correctness. Unlike methods margin
maximization methods such as SVMs and M3Ns (Taskar
et al., 2003), CRFs are trained to maximize conditional
probability and are thus more naturally appropriate for
confidence estimation. Interestingly, reranking by MAX-
ENT does not seem to improve performance, despite the
benefit Collins (2000) has shown discriminative rerank-
ing to provide generative parsers. We hypothesize this is
because CRFs are already discriminative (not joint, gen-
erative) models; furthermore, this may suggest that future
discriminative parsing methods will also have the benefits
of discriminative reranking built-in directly.
Acknowledgments
We thank the reviewers for helpful suggestions and refer-
ences. This work was supported in part by the Center for
Intelligent Information Retrieval, by the Advanced Research
and Development Activity under contract number MDA904-
01-C-0984, by The Central Intelligence Agency, the Na-
tional Security Agency and National Science Foundation un-
der NSF grant #IIS-0326249, and by the Defense Advanced
Research Projects Agency, through the Department of the Inte-
rior, NBC, Acquisition Services Division, under contract num-
ber NBCHD030010.
References
Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. 2002.
Probabilistic combination of text classifiers using reliability
indicators: models and results. In Proceedings of the 25th
annual international ACM SIGIR conference on Research
and development in information retrieval, pages 207–214.
ACM Press.
Michael Collins. 2000. Discriminative reranking for natu-
ral language parsing. In Proc. 17th International Conf. on
Machine Learning, pages 175–182. Morgan Kaufmann, San
Francisco, CA.
Simona Gandrabur and George Foster. 2003. Confidence esti-
mation for text prediction. In Proceedings of the Conference
on Natural Language Learning (CoNLL 2003), Edmonton,
Canada.
A. Gunawardana, H. Hon, and L. Jiang. 1998. Word-based
acoustic confidence measures for large-vocabulary speech
recognition. In Proc. ICSLP-98, pages 791–794, Sydney,
Australia.
Trausti Kristjannson, Aron Culotta, Paul Viola, and Andrew
McCallum. 2004. Interactive information extraction with
conditional random fields. To appear in Nineteenth National
Conference on Artificial Intelligence (AAAI 2004).
John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.
Conditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proc. 18th Interna-
tional Conf. on Machine Learning, pages 282–289. Morgan
Kaufmann, San Francisco, CA.
Andrew McCallum and David Jensen. 2003. A note on the
unification of information extraction and data mining using
conditional-probability, relational models. In IJCAI03 Work-
shop on Learning Statistical Models from Relational Data.
Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001.
Active hidden markov models for information extraction.
In Advances in Intelligent Data Analysis, 4th International
Conference, IDA 2001.
Ben Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-
margin markov networks. In Proceedings of Neural Infor-
mation Processing Systems Conference.
Cynthia A. Thompson, Mary Elaine Califf, and Raymond J.
Mooney. 1999. Active learning for natural language pars-
ing and information extraction. In Proc. 16th International
Conf. on Machine Learning, pages 406–414. Morgan Kauf-
mann, San Francisco, CA.
