Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 296–303,
New York, June 2006. c©2006 Association for Computational Linguistics
Integrating Probabilistic Extraction Models and Data Mining
to Discover Relations and Patterns in Text
Aron Culotta
University of Massachusetts
Amherst, MA 01003
culotta@cs.umass.edu
Andrew McCallum
University of Massachusetts
Amherst, MA 01003
mccallum@cs.umass.edu
Jonathan Betz
Google, Inc.
New York, NY 10018
jtb@google.com
Abstract
In order for relation extraction systems
to obtain human-level performance, they
must be able to incorporate relational pat-
terns inherent in the data (for example,
that one’s sister is likely one’s mother’s
daughter, or that children are likely to
attend the same college as their par-
ents). Hand-coding such knowledge can
be time-consuming and inadequate. Addi-
tionally, there may exist many interesting,
unknown relational patterns that both im-
prove extraction performance and provide
insight into text. We describe a probabilis-
tic extraction model that provides mutual
benefits to both “top-down” relational pat-
tern discovery and “bottom-up” relation
extraction.
1 Introduction
Consider these four sentences:
1. George W. Bush’s father is George H. W. Bush.
2. GeorgeH.W.Bush’ssisterisNancyBushEllis.
3. Nancy Bush Ellis’s son is John Prescott Ellis.
4. John Prescott Ellis analyzed George W. Bush’s
campaign.
We would like to build an automated system to
extract the set of relations shown in Figure 1.
cousin
Nancy Ellis BushsiblingGeorge HW Bush
George W Bush
son
John Prescott Ellis
son
Figure 1: Bush family tree
State of the art extraction algorithms may be able
to detect the son and sibling relations from local lan-
guage clues. However, the cousin relation is only
implied by the text and requires additional knowl-
edge to be extracted. Specifically, the system re-
quires knowledge of familial relation patterns.
One could imagine a system that accepts such
rules as input (e.g. cousin = father’s sister’s son)
and applies them to extract implicit relations. How-
ever, exhaustively enumerating all possible rules can
be tedious and incomplete. More importantly, many
relational patterns unknown a priori may both im-
prove extraction accuracy and uncover informative
trends in the data (e.g. that children often adopt the
religion of their parents). Indeed, the goal of data
mining is to learn such patterns from database reg-
ularities. Since these patterns will not always hold,
we would like to handle them probabilistically.
We propose an integrated supervised machine
learning method that learns both contextual and re-
lational patterns to extract relations. In particular,
we construct a linear-chain conditional random field
(Lafferty et al., 2001; Sutton and McCallum, 2006)
to extract relations from biographical texts while si-
multaneously discovering interesting relational pat-
terns that improve extraction performance.
296
2 Related Work
This work can be viewed as a step toward the in-
tegration of information extraction and data mining
technology, a direction of growing interest. Nahm
and Mooney (2000) present a system that mines as-
sociation rules from a database constructed from au-
tomaticallyextracteddata,thenappliestheselearned
rules to improve data field recall without revisiting
the text. Our work attempts to more tightly inte-
grate the extraction and mining tasks by learning
relational patterns that can be included probabilis-
tically into extraction to improve its accuracy; also,
our work focuses on mining from relational graphs,
rather than single-table databases.
McCallum and Jensen (2003) argue the theoreti-
cal benefits of an integrated probabilistic model for
extraction and mining, but do not construct such a
system. Our work is a step in the direction of their
proposal, using an inference procedure based on a
closed-loop iteration between extraction and rela-
tional pattern discovery.
Mostotherworkinthisareaminesrawtext, rather
than a database automatically populated via extrac-
tion (Hearst, 1999; Craven et al., 1998).
This work can also be viewed as part of a trend
to perform joint inference across multiple language
processing tasks (Miller et al., 2000; Roth and tau
Yih, 2002; Sutton and McCallum, 2004).
Finally, using relational paths between entities is
also examined in (Richards and Mooney, 1992) to
escapelocalmaximainafirst-orderlearningsystem.
3 Relation Extraction as Sequence
Labeling
Relation extraction is the task of discovering seman-
tic connections between entities. In text, this usu-
ally amounts to examining pairs of entities in a doc-
ument and determining (from local language cues)
whether a relation exists between them. Common
approaches to this problem include pattern match-
ing (Brin, 1998; Agichtein and Gravano, 2000),
kernel methods (Zelenko et al., 2003; Culotta and
Sorensen, 2004; Bunescu and Mooney, 2006), lo-
gistic regression (Kambhatla, 2004), and augmented
parsing (Miller et al., 2000).
The pairwise classification approach of kernel
methods and logistic regression is commonly a two-
phase method: first the entities in a document are
identified, then a relation type is predicted for each
pair of entities. This approach presents at least
two difficulties: (1) enumerating all pairs of enti-
ties, even when restricted to pairs within a sentence,
results in a low density of positive relation exam-
ples; and (2) errors in the entity recognition phase
can propagate to errors in the relation classification
stage. As an example of the latter difficulty, if a per-
son is mislabeled as a company, then the relation
classifier will be unsuccessful in finding a brother
relation, despite local evidence.
We avoid these difficulties by restricting our in-
vestigation to biographical texts, e.g. encyclopedia
articles. A biographical text mostly discusses one
entity, which we refer to as the principal entity. We
refer to other mentioned entities as secondary enti-
ties. For each secondary entity, our goal is to predict
what relation, if any, it has to the principal entity.
This formulation allows us to treat relation ex-
traction as a sequence labeling task such as named-
entity recognition or part-of-speech tagging, and we
can now apply models that have been successful on
those tasks. By anchoring one argument of relations
to be the principal entity, we alleviate the difficulty
of enumerating all pairs of entities in a document.
By converting to a sequence labeling task, we fold
the entity recognition step into the relation extrac-
tion task. There is no initial pass to label each entity
as a person or company. Instead, an entity’s label is
its relation to the principal entity. Below is an exam-
ple of a labeled article:
George W. Bush
George is the son of George H. W. Bushbracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
fatherand Barbara Bush
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
mother
.
Additionally, by using a sequence model we can
capturethedependencebetweenadjacentlabels. For
example, in our data it is common to see phrases
such as “son of the Republican president George H.
W. Bush” for which the labels politicalParty, jobTi-
tle, and father occur consecutively. Sequence mod-
els are specifically designed to handle these kinds
of dependencies. We now discuss the details of our
extraction model.
297
3.1 Conditional Random Fields
We build a model to extract relations using linear-
chain conditional random fields (CRFs) (Lafferty
et al., 2001; Sutton and McCallum, 2006). CRFs
are undirected graphical models (i.e. Markov net-
works)thatarediscriminatively-trainedtomaximize
the conditional probability of a set of output vari-
ables y given a set of input variables x. This condi-
tional distribution has the form
pΛ(y|x) = 1Z
x
productdisplay
c∈C
φc(yc,xc;Λ) (1)
where φ are potential functions parameterized by Λ
and Zx = summationtextyproducttextc∈C φ(yc,xc) is a normalization
factor. Assuming φc factorizes as a log-linear com-
bination of arbitrary features computed over clique
c, then φc(yc,xc;Λ) = exp(summationtextk λkfk(yc,xc)),
where f is a set of arbitrary feature functions over
the input, each of which has an associate model
parameter λk. Parameters Λ = {λk} are a set
of real-valued weights typically estimated from la-
beled training data by maximizing the data likeli-
hood function using gradient ascent.
In these experiments, we make a first-order
Markov assumption on the dependencies among y,
resulting in a linear-chain CRF.
4 Relational Patterns
The modeling flexibility of CRFs permits the fea-
turefunctionstobecomplex,overlappingfeaturesof
the input without requiring additional assumptions
on their inter-dependencies. In addition to common
language features (e.g. neighboring words and syn-
tactic information), in this work we explore features
that cull relational patterns from a database of enti-
ties.
As described in the introductory example (Figure
1), context alone is often insufficient to extract re-
lations. Even in simpler examples, it may be the
case that modeling relational patterns can improve
extraction accuracy.
To capture this evidence, we compute features
from a database to indicate relational connections
between entities, similar to the relational path-
finding performed in Richards and Mooney (1992).
Imagine that the four sentence example about the
Bush family is included in a training set, and the en-
cousin
father son
X Y
sibling
Figure 2: A feature template for the cousin relation.
tities are labeled with their correct relations. In this
case, the cousin relation in sentence 4 would also be
labeled. From this data, we can create a relational
database that contains the relations in Figure 1.
Assumesentence4comesfromabiographyabout
John Ellis. We calculate a feature for the entity
George W. Bush that indicates the path from John
Ellis to George W. Bush in the database, annotat-
ing each edge in the path with its relation label; i.e.
father-sibling-son. By abstracting away the actual
entity names, we have created a cousin template fea-
ture, as shown in Figure 2.
By adding these relational paths as features to
the model, we can learn interesting relational pat-
terns that may have low precision (e.g. “people are
likely to be friends with their classmates”) without
hampering extraction performance. This is in con-
trast to the system described in Nahm and Mooney
(2000), in which patterns are induced from a noisy
database and then applied directly to extraction. In
our system, since each learned path has an associ-
ated weight, it is simply another piece of evidence
to help the extractor. Low precision patterns may
have lower weights than high precision patterns, but
they will still influence the extractor.
A nice property of this approach is that examin-
inghighlyweightedpatternscanprovideinsightinto
regularities of the data.
4.1 Feature Induction
During CRF training, weights are learned for each
relational pattern. Patterns that increase extraction
performance will receive higher weights, while pat-
terns that have little effect on performance will re-
ceive low weights.
We can explore the space of possible conjunctions
of these patterns using feature induction for CRFs,
as described in McCallum (2003). Search through
the large space of possible conjunctions is guided
298
by adding features that are estimated to increase the
likelihood function most.
When feature induction is used with relational
patterns, we can view this as a type of data mining,
in which patterns are created based on their influ-
ence on an extraction model. This is similar to work
by Dehaspe (1997), where inductive logic program-
ming is embedded as a feature induction technique
for a maximum entropy classifier. Our work restricts
induced features to conjunctions of base features,
rather than using first-order clauses. However, the
patterns we learn are based on information extracted
from natural language.
4.2 Iterative Database Construction
The top-down knowledge provided by data min-
ing algorithms has the potential to improve the per-
formance of information extraction systems. Con-
versely, bottom-up knowledge generated by ex-
traction systems can be used to populate a large
database, from which more top-down knowledge
can be discovered. By carefully communicating the
uncertainty between these systems, we hope to iter-
atively expand a knowledge base, while minimizing
fallacious inferences.
In this work, the top-down knowledge consists of
relational patterns describing the database path be-
tween entities in text. The uncertainty of this knowl-
edge is handled by associating a real-valued CRF
weight with each pattern, which increases when the
pattern is predictive of other relations. Thus, the ex-
traction model can adapt to noise in these patterns.
Since we also desire to extract relations between
entities that appear in text but not in the database, we
first populate the database with relations extracted
by a CRF that does not use relational patterns. We
then do further extraction with a CRF that incorpo-
rates the relational patterns found in this automati-
cally generated database. In this manner, we create a
closed-loop system that alternates between bottom-
up extraction and top-down pattern discovery. This
approach can be viewed as a type of alternating opti-
mization, with analogies to formal methods such as
expectation-maximization.
The uncertainty in the bottom-up extraction step
is handled by estimating the confidence of each ex-
traction and pruning the database to remove en-
tries with low confidence. One of the benefits of
a probabilistic extraction model is that confidence
estimates can be straight-forwardly obtained. Cu-
lotta and McCallum (2004) describe the constrained
forward-backward algorithm to efficiently estimate
the conditional probability that a segment of text is
correctly extracted by a CRF.
Using this algorithm, we associate a confidence
value with each relation extracted by the CRF. This
confidence value is then used to limit the noise
introduced by incorrect extractions. This differs
from Nahm and Mooney (2000) and Mooney and
Bunescu(2005), inwhichstandarddecisiontreerule
learners are applied to the unfiltered output of ex-
traction.
4.3 Extracting Implicit Relations
An implicit relation is one that does not have direct
contextual evidence, for example the cousin relation
in our initial example. Implicit relations generally
require some background knowledge to be detected,
such as relational patterns (e.g. rules about familial
relations). These are the sorts of relations on which
current extraction models perform most poorly.
Notably, these are exactly the sorts of relations
thatarelikelytohavethebiggestimpactoninforma-
tion access. A system that can accurately discover
knowledge that is only implied by the text will dra-
matically increase the amount of information a user
can uncover, effectively providing access to the im-
plications of a corpus.
We argue that integrating top-down and bottom-
up knowledge discovery algorithms discussed in
Section 4.2 can enable this technology. By per-
forming pattern discovery in conjunction with infor-
mation extraction, we can collate facts from multi-
ple sources to infer new relations. This is an ex-
ample of cross-document fusion or cross-document
information extraction, a growing area of research
transforming raw extractions into usable knowledge
bases (Mann and Yarowsky, 2005; Masterson and
Kushmerik, 2003).
5 Experiments
5.1 Data
We sampled 1127 paragraphs from 271 articles from
theonlineencyclopediaWikipedia1 andlabeledato-
1http://www.wikipedia.org
299
George W. Bush
Dick Cheney
underling
Yale
education
Republican
partyPresident
jobTitle
George H. W. Bush
son
underlingHarken Energy
executive
education party
jobTitle
Prescott Bush
son
education
Bill Clinton
rival
Bob Dole
rival
education
Democrat
party
jobTitle
Hillary Clinton
husband
education
party
Halliburton
executiveeducation
Pres Medal of Freedom
awardparty
Nelson Rockefeller
award
Elizabeth Dole
wife
WWII
participant
awardparty
party
Martin Luther King, Jr.
award
Figure 3: An example of the connectivity of the entities in the data.
birthday birth year death day
death year nationality visited
birth place death place religion
job title member of cousin
friend discovered education
employer associate opus
participant influence award
brother wife supported idea
executive of political party supported person
founder son father
rival underling superior
role inventor husband
grandfather sister brother-in-law
nephew mother daughter
granddaughter grandson great-grandson
grandmother rival organization owner of
uncle descendant ancestor
great-grandfather aunt
Table 1: The set of labeled relations.
tal of 4701 relation instances. In addition to a large
set of person-to-person relations, we also included
links between people and organizations, as well as
biographical facts such as birthday and jobTitle. In
all, there are 53 labels in the training data (Table 1).
We sample articles that result in a high density
of interesting relations by choosing, for example, a
collection of related family members and associates.
Figure 3 shows a small example of the type of con-
nections in the data. We then split the data into train-
ing and testing sets (70-30 split), attempting to sep-
arate the entities into connected components. For
example, all Bush family members were placed in
the training set, while all Kennedy family members
were placed in the testing set. While there are still
occasional paths connecting entities in the training
set to those in the test set, we believe this method-
ology reflects a typical real-world scenario in which
we would like to extend an existing database to a
different, but slightly related, domain.
The structure of the Wikipedia articles somewhat
simplifies the extraction task, since important enti-
ties are hyper-linked within the text. This provides
an automated way to detect entities in the text, al-
though these entities are not classified by type. This
also allows us to easily construct database queries,
since we can reason at the entity level, rather than
the token level. (Although, see Sarawagi and Cohen
(2004) for extensions of CRFs that model the en-
tity length distribution.) The results we report here
are constrained to predict relations only for hyper-
linked entities. Note that despite this property, we
still desire to use a sequence model to capture the
dependencies between adjacent labels.
We use the MALLET CRF implementation (Mc-
Callum, 2002) with the default regularization pa-
rameters.
Basedoninitialexperiments,werestrictrelational
path features to length two or three. Paths of length
one will learn trivial paths and can lead to over-
fitting. Paths longer than three can increase compu-
tationalcostswithoutaddingmuchnewinformation.
In addition to the relational pattern features de-
scribed in Section 4, the list of local features in-
cludes context words (such as the token identity
within a 6 word window of the target token), lexi-
cons (such as whether a token appears in a list of
cities, people, or companies), regular expressions
(such as whether the token is capitalized or contains
digits or punctuation), part-of-speech (predicted by
a CRF that was trained separately for part of speech
tagging), prefix/suffix (such as whether a word ends
in -ed or begins with ch-), and offset conjunctions
(combinations of adjacent features within a window
of size six).
300
ME CRF0 CRFr CRFr0.9 CRFr0.5 CRFt CRFt0.5
F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363
P .6475 .7019 .6799 .7177 .7095 .7553 .7343
R .4763 .5232 .5531 .5166 .5406 .6169 .5614
Table 2: Results comparing the relative benefits of using relational patterns in extraction.
5.2 Extraction Results
We evaluate performance by calculating the preci-
sion (P) and recall (R) of extracted relations, as well
as the F1 measure, which is the harmonic mean of
precision and recall.
CRF0 is the conditional random field constructed
without relational features. Results for CRF0 are
displayed in the second column of Table 2. ME is
a maximum entropy classifier trained on the same
feature set as CRF0. The difference between these
two models is that CRF0 models the dependence of
relations that appear consecutively in the text. The
superior performance of CRF0 suggests that this de-
pendence is important to capture.
The remaining models incorporate the relational
patterns described in Section 4. We compare three
different confidence thresholds for the construction
of the initial testing database, as described in Sec-
tion 4.2. CRFr uses no threshold, while CRFr0.9
andCRFr0.5restrictthedatabasetoextractionswith
confidence greater than 0.9 and 0.5, respectively.
As shown by comparing CRF0 and CRFr in Ta-
ble 2, the relational features constructed from the
database with no confidence threshold provides a
considerable boost in recall (reducing error by 7%),
at the cost of a decrease in precision. Here we see
the effect of making fallacious inferences on a noisy
database.
In column four, we see the opposite effect for
the overly conservative threshold of CRFr0.9. Here,
precision improves slightly over CRF0, and consid-
erably over CRFr (12% error reduction), but this is
accompanied by a drop in recall (8% reduction).
Finally, in column five, a confidence of 0.5 results
in the best F1 measure (a 3.5% error reduction over
CRF0). CRFr0.5 alsoobtainsbetterrecallandpreci-
sion than CRF0, reducing recall error by 3.6%, pre-
cision error by 2.5%.
Comparing the performance on different relation
types, we find that the biggest increase from CRF0
to CRFr0.5 is on the memberOf relation, for which
the F1 score improves from 0.4211 to 0.6093. We
conjecture that the reason for this is that the patterns
mostusefulforthememberOflabelcontainrelations
that are well-detected by the first-pass CRF. Also,
thelocallanguagecontextseemsinadequatetoprop-
erly extract this relation, given the low performance
of CRF0.
To better gauge how much relational pattern fea-
tures are affected by errors in the database, we run
two additional experiments for which the relational
features are fixed to be correct. That is, imagine that
we construct a database from the true labeling of the
testingdata, andcreatetherelationalpatternfeatures
from this database. Note that this does not trivialize
the problem, since there are no relational path fea-
tures of length one (e.g., if X is the wife of Y, there
will be no feature indicating this).
We construct two experiments under this scheme,
one where the entire test database is used (CRFt),
and another where only half the relations are in-
cluded in the test database, selected uniformly at
random (CRFt0.5).
Column six shows the improvements enabled by
using the complete testing database. More inter-
estingly, column seven shows that even with only
half the database accurately known, performance
improves considerably over both CRF and CRFr0.5.
ArealisticscenarioforCRFt0.5isasemi-automated
system, in which apartially-filled databaseis usedto
bootstrap extraction.
5.3 Mining Results
Comparing the impact of discovered patterns on ex-
traction is a way to objectively measure mining per-
formance. We now give a brief subjective evaluation
of the learned patterns. By examining relational pat-
terns with high weights for a particular label, we can
glean some regularities from our dataset. Examples
of such patterns are in Table 3.
301
Relation Relational Path Feature
mother father → wife
cousin mother → husband → nephew
friend education → student
education father → education
boss boss → son
memberOf grandfather → memberOf
rival politicalParty → member → rival
Table3: Examplesofhighlyweightedrelationalpat-
terns.
Fromthefamilialrelationsinourtrainingdata,we
are able to discover many equivalences for mothers,
cousins, grandfathers, and husbands. In addition to
these high precision patterns, the system also gener-
ates interesting, low precision patterns. Row 3-7 of
Table 3 can be summarized by the following gener-
alizations: friends tend to be classmates; children of
alumni often attend the same school as their parents;
a boss’ child often becomes the boss; grandchildren
are often members of the same organizations as their
grandparents; and rivals of a person from one polit-
ical party are often rivals of other members of the
same political party. While many of these patterns
reflectthehighconcentrationofpoliticalentitiesand
familial relations in our training database, many will
have applicability across domains.
5.4 Implicit Relations
It is difficult to measure system performance on im-
plicit relations, since our labeled data does not dis-
tinguish between explicit and implicit relations. Ad-
ditionally, accurately labeling all implicit relations
is challenging even for a human annotator.
We perform a simple exploratory analysis to de-
termine how relational patterns can help discover
implicit relations. We construct a small set of syn-
thetic sentences for which CRF0 successfully ex-
tracts relations using contextual features. We then
add sentences with slightly more ambiguous lan-
guageandmeasurewhetherCRFr canovercomethis
ambiguity using relational pattern features.
For example, we create an article about an en-
tity named “Bob Smith” that includes the sentences
“His brother, Bill Smith, was a biologist” and “His
companion, Bill Smith, was a biologist.” CRF0 suc-
cessfully returns the brother relation in the first sen-
tence, but not the second. After a fact is added to
the database that says Bob and Bill have a brother in
commonnamedJohn, CRFr isabletocorrectlylabel
the second sentence in spite of the ambiguous word
“companion,” because CRF0 has a highly-weighted
relational pattern feature for brother.
Similar behavior is observed for low precision
patterns like “associates tend to win the same
awards.” A synthetic article for the entity “Tom
Jones” contains the sentences “He was awarded the
Pulitzer Prize in 1998” and “Tom got the Pulitzer
Prize in 1998.” Because CRF0 is highly-reliant on
the presence of the verb “awarded” or “won” to indi-
cate a prize fact, it fails to label the second sentence
correctly. Afterthedatabaseisaugmentedtoinclude
thefactthatTom’sassociateJillreceivedthePulitzer
Prize, CRFr labels the second sentence correctly.
However, we also observed that CRFr still re-
quires some contextual clues to extract implicit re-
lations. For example, if the Tom Jones article in-
stead contains the sentence “The Pulitzer Prize was
awarded to him in 1998,” neither CRF labels the
prize fact correctly, since this passive construction
is rarely seen in the training data.
We conclude from this brief analysis that rela-
tional patterns used by CRFr can help extract im-
plicit relations when (1) the database contains ac-
curate relational information, and (2) the sentence
contains limited contextual clues. Since relational
patterns are treated only as additional features by
CRFr, they are generally not powerful enough to
overcome a complete absence of contextual clues.
Fromthisperspective,relationalpatternscanbeseen
as enhancing the signal from contextual clues. This
differs from deterministically applying learned rules
independent of context, which may boost recall at
the cost of precision.
6 Conclusions and Future Work
We have shown that integrating pattern discovery
with relation extraction can lead to improved per-
formance on each task.
In the future, we wish to explore extending this
methods to larger datasets, where we expect rela-
tional patterns to be even more interesting. Also,
weplantoimproveuponiterativedatabaseconstruc-
tion by performing joint inference among distant
302
relations in an article. Inference in these highly-
connected models will likely require approximate
methods. Additionally, we wish to focus on extract-
ing implicit relations, dealing more formally with
the precision-recall trade-off inherent in applying
noisy rules to improve extraction.
7 Acknowledgments
Thanks to the Google internship program, and to Charles Sutton
for providing the CRF POS tagger. This work was supported in
part by the Center for Intelligent Information Retrieval, in part
by U.S. Government contract #NBCH040171 through a sub-
contract with BBNT Solutions LLC, in part by The Central In-
telligence Agency, the National Security Agency and National
Science Foundation under NSF grant #IIS-0326249, and in part
by the Defense Advanced Research Projects Agency (DARPA),
through the Department of the Interior, NBC, Acquisition Ser-
vices Division, under contract number NBCHD030010. Any
opinions, findings and conclusions or recommendations ex-
pressed in this material are the author(s) and do not necessarily
reflect those of the sponsor.
References
Eugene Agichtein and Luis Gravano. 2000. Snowball: Extract-
ing relations from large plain-text collections. In Proceed-
ings of the Fifth ACM International Conference on Digital
Libraries.
Sergey Brin. 1998. Extracting patterns and relations from the
world wide web. In WebDB Workshop at 6th International
Conference on Extending Database Technology.
Razvan Bunescu and Raymond Mooney. 2006. Subsequence
kernels for relation extraction. In Y. Weiss, B. Sch¨olkopf,
and J. Platt, editors, Advances in Neural Information Pro-
cessing Systems 18. MIT Press, Cambridge, MA.
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew K. Mc-
Callum, Tom M. Mitchell, Kamal Nigam, and Se´an Slattery.
1998. Learning to extract symbolic knowledge from the
World Wide Web. In Proceedings of AAAI-98, 15th Confer-
ence of the American Association for Artificial Intelligence,
pages 509–516, Madison, US. AAAI Press, Menlo Park, US.
Aron Culotta and Andrew McCallum. 2004. Confidence es-
timation for information extraction. In Human Langauge
Technology Conference (HLT 2004), Boston, MA.
Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree
kernels for relation extraction. In ACL.
L. Dehaspe. 1997. Maximum entropy modeling with clausal
constraints. In Proceedings of the Seventh International
Workshop on Inductive Logic Programming, pages 109–125,
Prague, Czech Republic.
M. Hearst. 1999. Untangling text data mining. In 37th Annual
Meeting of the Association for Computational Linguistics.
Nanda Kambhatla. 2004. Combining lexical, syntactic, and se-
mantic features with maximum entropy models for extract-
ing relations. In ACL.
John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.
Conditional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proc. 18th Interna-
tional Conf. on Machine Learning, pages 282–289. Morgan
Kaufmann, San Francisco, CA.
Gideon Mann and David Yarowsky. 2005. Multi-field informa-
tion extraction and cross-document fusion. In ACL.
D. Masterson and N. Kushmerik. 2003. Information extraction
from multi-document threads. In ECML-2003: Workshop on
Adaptive Text Extraction and Mining, pages 34–41.
Andrew McCallum and David Jensen. 2003. A note on the
unification of information extraction and data mining us-
ing conditional-probability, relational models. In IJCAI03
Workshop on Learning Statistical Models from Relational
Data.
Andrew McCallum. 2002. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu.
Andrew McCallum. 2003. Efficiently inducing features of con-
ditional random fields. In Nineteenth Conference on Uncer-
tainty in Artificial Intelligence (UAI03).
Scott Miller, Heidi Fox, Lance A. Ramshaw, and Ralph
Weischedel. 2000. A novel use of statistical parsing to ex-
tract information from text. In ANLP.
Raymond J. Mooney and Razvan Bunescu. 2005. Mining
knowledge from text using information extraction. SigKDD
ExplorationsonTextMiningandNaturalLanguageProcess-
ing.
Un Yong Nahm and Raymond J. Mooney. 2000. A mutually
beneficial integration of data mining and information extrac-
tion. In AAAI/IAAI.
Bradley L. Richards and Raymond J. Mooney. 1992. Learning
relations by pathfinding. In Proceedings of the Tenth Na-
tionalConferenceonArtificialIntelligence(AAAI-92), pages
50–55, San Jose, CA.
Dan Roth and Wen tau Yih. 2002. Probabilistic reasoning for
entity and relation recognition. In COLING.
Sunita Sarawagi and William W. Cohen. 2004. Semi-markov
conditional random fields for information extraction. In
NIPS 04.
Charles Sutton and Andrew McCallum. 2004. Dynamic condi-
tional random fields: Factorized probabilistic models for la-
beling and segmenting sequence data. In Proceedings of the
Twenty-First International Conference on Machine Learning
(ICML).
Charles Sutton and Andrew McCallum. 2006. An introduction
to conditional random fields for relational learning. In Lise
Getoor and Ben Taskar, editors, Introduction to Statistical
Relational Learning. MIT Press. To appear.
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.
2003. Kernel methods for relation extraction. Journal of
Machine Learning Research, 3:1083–1106.
303
