Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 57–64, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Redundancy-based Correction of Automatically Extracted Facts
Roman Yangarber and Lauri Jokipii
Department of Computer Science
University of Helsinki, Finland
first.last@cs.helsinki.fi
Abstract
The accuracy of event extraction is lim-
ited by a number of complicating factors,
with errors compounded at all sages in-
side the Information Extraction pipeline.
In this paper, we present methods for re-
covering automatically from errors com-
mitted in the pipeline processing. Recov-
ery is achieved via post-processing facts
aggregated over a large collection of doc-
uments, and suggesting corrections based
on evidence external to the document. A
further improvement is derived from prop-
agating multiple, locally non-best slot fills
through the pipeline. Evaluation shows
that the global analysis is over 10 times
more likely to suggest valid corrections to
the local-only analysis than it is to suggest
erroneous ones. This yields a substantial
overall gain, with no supervised training.
1 Introduction
Information Extraction (IE) is a technology for find-
ing facts in plain text, and coding them in a logical
representation, such as, e.g., a relational database.
IE is typically viewed and implemented as a se-
quence of stages—a “pipeline”:
1. Layout, tokenization, lexical analysis
2. Name recognition and classification
3. Shallow (commonly,) syntactic parsing
4. Resolution of co-reference among entities
5. Pattern-based event matching and role mapping
6. Normalization and output generation
While accuracy at the lowest levels can reach high
90’s, as the stages advance, complexity increases
and performance degrades considerably.
The problem of IE as a whole, as well each of
the listed subproblems, has been studied intensively
for well over a decade, in many flavors and varieties.
Key observations about much state-of-the-art IE are:
a. IE is typically performed by a pipeline process;
b. Only one hypothesis is propagated through the
pipeline for each fact—the “best guess” the
system can make for each slot fill;
c. IE is performed in a document-by-document
fashion, applying a priori knowledge locally to
each document.
The a priori knowledge may be encoded in a set of
rules, an automatically trained model, or a hybrid
thereof. Information extracted from documents—
which may be termed a posteriori knowledge—
is usually not reused across document boundaries,
because the extracted facts are imprecise, and are
therefore not a reliable basis for future extraction.
Furthermore, locally non-best slot fills are not
propagated through the pipeline, and are conse-
quently not available downstream, nor for any global
analysis.
In most systems, these stages are performed in se-
quence. The locally-best slot fills are passed from
57
the “lower-” to the “higher-level” modules, with-
out feedback. Improvements are usually sought
(e.g., the ACE research programme, (ACE, 2004))
by boosting performance at the lower levels, to reap
benefits in the subsequent stages, where fewer errors
are propagated.
The point of departure for this paper is: the
IE process is noisy and imprecise at the single-
document level; this has been the case for some time,
and though there is much active research in the area,
the situation is not likely to change radically in the
immediate future—rather, we can expect slow, in-
cremental improvements over some years.
In our experiments, we approach the performance
problem from the opposite end: start with the ex-
tracted results and see if the totality of a posteri-
ori knowledge about the domain—knowledge gen-
erated by the same noisy process we are trying to
improve—can help recover from errors that stem
from locally insufficient a priori knowledge.
The aim of the research presented in this paper
is to improve performance by aggregating related
facts, which were extracted from a large document
collection, and to examine to what extent the cor-
rectly extracted facts can help correct those that were
extracted erroneously.
The rest of the paper is organized as follows. Sec-
tion 2 contains a brief review of relevant prior work.
Section 3 presents the experimental setup: the text
corpus, the IE process, the extracted facts, and what
aspects of the the extracted facts we try to improve
in this paper. Section 4 presents the methods for im-
proving the quality of the data using global analysis,
starting with a naive, baseline method, and proceed-
ing with several extensions. Each method is then
evaluated, and the results are examined in section 5.
In section 6, we present further extensions currently
under research, followed by the conclusion.
2 Prior Work
As we stated in the introduction, typical IE sys-
tems consist of modules arranged in a cascade, or
a pipeline. The modules themselves are be based
on heuristic rules or automatically trained, there is
an abundance of approaches in both camps (and ev-
erywhere in between,) to each of the pipeline stages
listed in the introduction.
It is our view that to improve performance we
ought to depart from the traditional linear, pipeline-
style design. This view is shared by others in the
research community; the potential benefits have pre-
viously been recognized in several contexts.
In (Nahm and Mooney, 2000a; Nahm and
Mooney, 2000b), it was shown that learning rules
from a fact base, extracted from a corpus of job post-
ings for computer programmers, improves future ex-
traction, even though the originally extracted facts
themselves are far from error-free. The idea is to
mine the data base for association rules, and then to
integrate these rules into the extraction process.
The baseline system is obtained by supervised
learning from a few hundred manually annotated ex-
amples. Then the IE system is applied to succes-
sively larger sets of unlabeled examples, and associ-
ation rules are mined from the extracted facts. The
resulting combined system (trained model plus as-
sociation rules) showed an improvement in perfor-
mance on a test set, which correlated with the size
of the unlabeled corpus.
In work on improving (Chinese) named entity tag-
ging, (Ji and Grishman, 2004; Ji and Grishman,
2005), show benefits to this component from in-
tegrating decisions made in later stages, viz. co-
reference, and relation extraction.1
Tighter coupling and integration between IE and
KDD components for mutual benefit is advocated by
(McCallum and Jensen, 2003), which present mod-
els based on CRFs and supervised training.
This work is related in spirit to the work pre-
sented in this paper, in its focus on leveraging cross-
document information that information—though it
is inherently noisy—to improve local decisions. We
expect that the approach could be quite powerful
when these ideas are used in combination, and our
experiments seem to confirm this expectation.
3 Experimental Setup
In this section we describe the text corpus, the un-
derlying IE process, the form of the extracted facts,
and the specific problem under study—i.e., which
aspects of these facts we first try to improve.
1Performance on English named entity tasks reaches mid to
high 90’s in many domains.
58
3.1 Corpus
We conducted experiments with redundancy-based
auto-correction over a large database of facts ex-
tracted from the texts in ProMED-Mail, a mailing
list which carries reports about outbreaks of infec-
tious epidemics around the world and the efforts
to contain them. This domain has been explored
earlier; see, e.g., (Grishman et al., 2003) for an
overview.
Our underlying IE system is described in (Yan-
garber et al., 2005). The system is a hybrid
automatically- and manually-built pattern base for
finding facts, an HMM-based name tagger, auto-
matically compiled and manually verified domain-
specific ontology, based in part on MeSH, (MeS,
2004), and a rule-based co-reference module, that
uses the ontology.
The database is live on-line, and is continuously
updated with new incoming reports; it can be ac-
cessed at doremi.cs.helsinki.fi/plus/.
Text reports have been collected by ProMED-
Mail for over 10 years. The quality of reporting (and
editing) has been rising over time, which is easy to
observe in the text data. The distribution of the data,
aggregated by month is shown in Figure 1, where
one can see a steady increase in volume over time.2
3.2 Extracted Facts
We now describe the makeup of the data extracted
from text by the IE process, with basic terminology.
Each document in the corpus, contains a single re-
port, which may contain one or more stories. Story
breaks are indicated by layout features, and are ex-
tracted by heuristic rules, tuned for this domain and
corpus. When processing a multi-story report, the
IE system treats each story as a separate document;
no information is shared among stories, except that
the text of the main headline of a multi-story report
is available to each story. 3
Since outbreaks may be described in complex
ways, it is not obvious how to represent a single fact
in this context. To break down this problem, we use
the notion of an incident. Each story may contain
2This is beneficial to the IE process, which operates better
with formulaic, well-edited text.
3The format of the documents in the archive can be exam-
ined by browsing the source site www.promedmail.org.
 0
 200
 400
 600
 800
 1000
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
 Count 
 Date 
 
  Records   (46,317)
  Stories   (30,015)
  Documents (22,560)
Figure 1: Distribution of data in ProMED-Mail
multiple outbreak-related incidents/facts.4
We analyze an outbreak as a series of incidents.
The incidents may give “redundant” information
about an outbreak, e.g., by covering overlapping
time intervals or geographic areas. For example, a
report may first state the number of cases within the
last month, and then give the total for the entire year.
We treat each of these statements as a separate inci-
dent; the containment relations among them are be-
yond the scope of our current goals.5
Thus each incident corresponds to a partial de-
scription of an outbreak, over a period of time and
geographic area. This makes it easy to represent
each incident/fact as a separate row in the table.
The key fields of the incident table are:
a0 Disease Name
a0 Location
a0 Date (start and end)
Where possible, the system also extracts informa-
tion about the victims affected in the incident—their
count, severity (affected or dead), and a descriptor
(people, animals, etc.). The system also extracts
bookkeeping information about each incident: loca-
tions of mentions of the key fields in the text, etc.
The system’s performance is currently at 71.16 F-
measure: 67% recall, 74% precision. This score is
obtained by a MUC scorer (Douthat, 1998) on a 50-
document test corpus, which was manually tagged
with correct incidents with these slots. We have
4In this paper, we use the terms fact, incident, and event
interchangeably.
5This problem is addressed in, e.g., (Huttunen et al., 2002).
59
no blind-test corpus at present, but prior experience
suggests that we ought to expect about a 10% reduc-
tion in F-measure on unseen data; this is approxi-
mately borne out by our informal evaluations.
Further, the system attempts to “normalize” the
key fields. An alias for a disease name (e.g., “bird
flu”) is mapped to a canonical name (“avian in-
fluenza.”)6 Date expressions are normalized to a
standard format yyyy.mm.dd–yyyy.mm.dd.7
Note that the system may not be able to normalize
some entities, which then remain un-normalized.
Such normalization is clearly helpful for search-
ing, but it is not only a user-interface issue. Normal-
izing reduces sparseness of data; and since our intent
is to aggregate related facts across a large fact base,
excessive variation in the database fields would re-
duce the effectiveness of the proposed methods.
3.3 Experimental Focus: Location
Normalization
A more complex problem arises out of the need to
normalize location names. For each record, we nor-
malize the location field—which may be a name of
a small village or a larger area—by relating it to the
name of the containing country; we also decided to
map locations in the United States to the name of the
containing state, (rather than the name of the coun-
try, “USA”).8 This mapping will be henceforth re-
ferred to as “location–state,” for short. The ideas
presented in the introduction are explored in the re-
mainder of this paper in the context of correcting the
location–state mapping.
Section 6 will touch upon our current work on ex-
tending the methodology to slots other than state.
(Please see Section 5 for further justification of this
choice for our initial experiments.)
To make the experiments interesting and fair, we
kept the size of the gazetteer small. The a priori geo-
graphic knowledge base contains names of countries
of the world (270), with aliases for several of them; a
list of capitals and other selected major cities (300);
a list of states in the USA and acronyms (50); major
6This is done by means of a set of scenario-specific patterns
and a dictionary of about 2500 disease names with aliases.
7Some date intervals may not have a starting date, e.g., if the
text states “As of last Tuesday, the victim count is N...”
8This decision was made because otherwise records with
state = USA strongly skew the data, and complicate learning.
US cities (100); names of the (sub)continents (10),
and oceans. In our current implementation, conti-
nents are treated semantically as “states” as well.9
The IE system operates in a local, document-by-
document fashion. Upon encountering a location
name that is not in its dictionaries, the system has
two ways to map it to the state name. One way is
by matching patterns over the immediate local con-
text, (“Milan, Italy”). Failing that, it tries to find
the corresponding state by positing an “underspeci-
fied” state name (as if referred to by a kind of spe-
cial “pronoun”) and mapping the location name to
that. The reference resolution module then finds the
most likely antecedent entity, of the semantic type
“state/country,” where likelihood is determined by
its proximity to the mention of the location name.
Note that the IE system outputs only a single, best
hypothesis for the state fill for each record.
3.4 The Data
The database currently contains about a0a2a1a4a3a6a5a8a7a10a9 in-
dividual facts/incidents, extracted from a5a12a11a8a3a13a11a14a7a16a15 sto-
ries, from a17a18a17 a3a6a15a18a1a12a11 reports (cf. Fig. 1). Each incident
has a location and a state filler. We say a location
name is “ambiguous” if it appears in the location slot
of at least two records, which have different names
in the state slot. The number of distinct “ambigu-
ous” location names is a7a12a3a13a11 a17 a11 .
Note, this terminology is a bit sloppy: the fillers
to which we refer as “ambiguous location names”,
may not be valid location names at all; they may
simply be errors in the IE process. E.g., at the name
classification stage, a disease name (especially if not
in the disease dictionary) may be misclassified, and
used as a filler for the location slot.
We further group together the location fills by
stripping lower-case words that are not part of the
proper name, from the front and the end of the fill.
E.g., we group together “southern Mumbai” and
“the Mumbai area,” as referring to the same name.
After grouping and trimming insignificant words,
the number of distinct names appearing in location
fills is a1a12a11a18a11 , which covers a total of a1a18a15a18a19a18a5 records,
or a7a20a0a14a21 a17a23a22 of all extracted facts. As an estimate of
the potential for erroneous mapping from locations
to states, this is quite high, about a7 in a9 records.10
9By the same token, both Connecticut and USA are “states.”
10Of course, it can be higher as well, if the IE system con-
60
4 Experiments and Results
We now present the methods of correcting possible
errors in the location–state relation. A method a0
tries to suggest a new value for the state fill for every
incident I that contains an ambiguous location fill:
a1a3a2a5a4a7a6a9a8a11a10a12a8a13a2a12a14a16a15a18a17a20a19a22a21a24a23a26a25a28a27a29a21a18a30
a31a13a32a18a33a24a34a36a35a38a37a40a39
a6a42a41a44a43a46a45a47a2a46a48a3a14a50a49a2a3a26a15a51a17 (1)
where a52 a48 a14a16a15a51a17 is a set of all candidate states consid-
ered by a0 for I; a6a53a41a54a43a18a45a47a2 a48 a14a50a49a2a3a26a15a51a17 is the scoring func-
tion specific to a0 . The method chooses the candi-
date state which maximizes the score.
For each method below, we discuss how a52 a48 and
a6a42a41a44a43a18a45a47a2 a48 are constructed.
4.1 Baseline: Raw Majority
We begin with a simple recovery approach. We sim-
ply assume that the correct state for an ambiguous
location name is the state most frequently associated
with it in the database. We denote by a55 the set of all
incidents in the database. For an incident a15a57a56 a55 , we
write a58 a3a59a49a61a60a62a15 when location a58 , state a49 , etc., “belong”
to I, i.e., are extracted as fills in I. In the baseline
method, a63 , for each incident a15 where a58 a60a62a15 is one of
the a1a12a11a18a11 ambiguous location names, we define:
a52a65a64
a14a16a15a51a17a20a19a67a66a46a49a51a68a70a69a72a71a72a15a44a68a73a56
a55a75a74
a14
a58
a3a59a49a51a68a76a17a77a60a78a15a44a68a80a79
a6a42a41a44a43a18a45a81a2
a64
a14a50a49a51a68 a3a26a15a51a17a53a19
a74
a66a51a15a44a68a73a56
a55a75a74
a14
a58
a3a59a49a51a68a76a17a77a60a78a15a44a68a80a79
a74
i.e., a49 a68 is a candidate if it is a state fill in some in-
cident whose location fill is also a58 ; the score is the
number of times the pair a14 a58 a3a59a49a18a68a76a17 appear together in
some incident in a55 . The majority winner is then
suggested as the “correct” state, for every record
containing a58 . By “majority” winner we mean the
candidate with the maximal count, rather than a state
with more than half of the votes. When the candi-
dates tie for first place, no suggestions are made—
although it is quite likely that some of the records
carrying a58 will have incorrect state fills.
A manual evaluation of the performance of this
method is shown in Table 1, the Baseline column.
The first row shows for how many records this
method suggested a change from the original, IE-
filled state. The baseline changed 858 incidents.
sistently always maps some location name to the same wrong
state; these cases are below the radar of our scheme, in which
the starting point is the “ambiguous” locations.
This constitutes about 13% out of the maximum
number of changeable records, a1a18a15a18a19a18a5 .
Thus, this number represents the volume of the
potential gain or loss from the global analysis: the
proportion of records that actually get modified.
The remaining records were unchanged, either be-
cause the majority winner coincides with the origi-
nal IE-extracted state, or because there was a tie for
the top score, so no decision could be made.
We manually verified a substantial sample of the
modified records. When verifying the changes, we
referred back to the text of the incident, and, when
necessary, consulted further geographical sources to
determine exactly whether the change was correct in
each case.
For the baseline we had manually verified 27.7%
of the changes. Of these, 68.5% were a clear gain:
an incorrect state was changed to a correct state.
6.3% were a clear loss, a correct state lost to an in-
correct one. This produces quite a high baseline, sur-
prisingly difficult to beat.
The next two rows represent the “grey” areas.
These are records which were difficult to judge,
for one of two technical reasons. A: the “loca-
tion” name was itself erroneous, in which case these
redundancy-based approaches are not meaningful;
or, B: the suggestion replaces an area by its sub-
region or super-region, e.g., changing “Connecticut”
to “USA”, or “Europe” to “France.”11
Although it is not strictly meaningful to judge
whether these changes constitute a gain or a loss,
we nonetheless tried to assess whether changing the
state hurt the accuracy of the incident, since the in-
cident may have a correct state even though its loca-
tion is erroneous (case A); likewise, it may be cor-
rect to say that a given location is indeed a part of
Connecticut, in which case changing it to USA loses
information, and is a kind of loss.
That is the interpretation of the grey gain and loss
instances. The final row, no loss, indicates the pro-
portion of cases where an originally incorrect state
name was changed to a new one, also incorrect.
11Note, that for some locations, which are not within any one
state’s boundary, a continent is a “correct state”, for example,
“the Amazon Region,” or “Serengeti.”
61
Records Baseline DB-filtered Confidence Multi-candidate
Changed a7a16a5a4a21 a11 a22
a19a18a15a18a19a1a0a12a1a18a15a18a19a18a5 a19a4a21 a9
a22
a15a23a9a18a9a2a0a12a1a18a15a18a19a18a5 a3a4a21 a9
a22
a1 a0
a17
a0a12a1a18a15a18a19a18a5 a4a6a5 a21a8a7
a22
a7 a11a2a9
a17
a0a12a1a18a15a18a19a18a5
Verified a17 a9 a21 a9 a22 a17 a5a18a19a1a0a12a19a18a15a18a19 a5a18a19a4a21 a1 a22 a17a18a17 a5a1a0a12a15a23a9a18a9 a5a23a9 a21 a7 a22 a17 a5a18a19a1a0a12a1 a0 a17 a17 a1a4a21 a15 a22 a17 a19 a0a6a0a4a7 a11a2a9 a17
Gain a1a18a19a4a21 a15 a22 a7a16a1a18a5a1a0 a17 a5a18a19 a9a4a7a12a21 a5 a22 a7a16a15a2a3a1a0 a17a18a17 a5 a19a12a11a8a21 a5 a22 a7a9a3a8a7a10a0 a17 a5a18a19 a11a12a5 a21a14a13 a22 a17 a7a10a9a2a0 a17 a19 a0
Loss a1a4a21 a5 a22 a7a16a15a1a0 a17 a5a18a19 a17 a21 a17a23a22 a15a1a0 a17a18a17 a5 a7a12a21 a5 a22 a5a1a0 a17 a5a18a19 a15 a21a8a16 a22 a7 a11a17a0 a17 a19 a0
Grey gain a7 a11a8a21a18a3 a22 a17
a1a1a0
a17
a5a18a19 a7
a17
a21 a1
a22 a17
a19a1a0
a17a18a17
a5 a7a18a7a12a21 a19
a22 a17
a19a1a0
a17
a5a18a19 a7a16a5a4a21 a9
a22
a5a2a3a1a0
a17
a19 a0
Grey loss a1a4a21 a9 a22 a7a16a1a1a0 a17 a5a18a19 a15a4a21 a0 a22 a7 a17 a0 a17a18a17 a5 a11a8a21 a11 a22 a11a17a0 a17 a5a18a19 a17 a21 a7 a22 a1a1a0 a17 a19 a0
No loss a9 a21 a1 a22 a7a16a19a1a0 a17 a5a18a19 a19a4a21 a15 a22 a7a9a3a1a0 a17a18a17 a5 a1a4a21 a9 a22 a7a16a1a1a0 a17 a5a18a19 a0a14a21 a17a23a22 a7 a17 a0 a17 a19 a0
Table 1: Performance of Correction Methods
4.2 Database Filtering
Next we examined a variant of baseline raw major-
ity vote, noting that simply choosing the state most
frequently associated with a location name is a bit
naive: the location–state relation is not functional—
i.e., some location names map to more than one state
in reality. There are many locations which share the
same name.12
To approach this more intelligently, we define:
a52a20a19
a14a16a15a51a17 a19
a52a65a64
a14a16a15a51a17a20a21 a6a9a8a11a10 a8a11a2a46a49a10a22a17a23 a6a9a8 a43a18a45a25a24 a14a16a15a51a17
a6a42a41a44a43a46a45a47a2
a19
a14a50a49 a68 a3a26a15a51a17a53a19 a6a42a41a44a43a18a45a47a2
a64
a14a50a49 a68 a3a26a15a51a17
The baseline vote counting across the data base (DB)
produced a ranked list of candidate states a49 a68 for the
location a58 a60 a15 . We then filtered this list through
a6a9a8a11a10a12a8a13a2a51a49a26a22a1a23 a6a42a8 a43a18a45a25a24 a14a16a15a51a17 , the list of states mentioned in
the story containing the incident a15 . The filtered ma-
jority winner was selected as the suggested change.
For example, the name “Athens” may refer to the
city in Greece, or to the city in Georgia (USA).
Suppose that Greece is the raw majority winner.
The baseline method will always tag all instances
of Athens as being in Greece. However, in a story
about Georgia, Greece will likely not be mentioned
at all, so it is safe to rule it out. This helps a minority
winner, when the majority is not present in the story.
Surprisingly, this method did not yield a substan-
tial improvement over the baseline, (though it was
more careful by changing fewer records). This may
indicate that NWP is not an important source of er-
rors here: though many truly ambiguous locations
12We refer to this as the “New-World phenomenon” (NWP),
due to its prevalence in the Americas: “Santa Cruz” occurs in
several Latin American countries; locations named after saints
are common. In the USA, city and county names often appear
in multiple states—Winnebago County, Springfield; many cities
are named after older European cities.
do exist, they do not account for many instances in
this DB.
4.3 Confidence-Based Ranking
A more clear improvement over the baseline is ob-
tained by taking the local confidence of the state–
location association into account. For each record,
we extend the IE analysis to produce a confidence
value for the state. Confidence is computed by sim-
ple, document-local heuristics, as follows:
If the location and state are both within the span
of text covered by the incident—text which was ac-
tually matched by a rule in the IE system,—or if the
state is the unique state mentioned in the story, it gets
a score of 2—the incident has high confidence in the
state. Otherwise, if the state is outside the incident’s
span, but is inside the same sentence as the incident,
and is also the unique state mentioned in that sen-
tence, it gets a score of 1. Otherwise it receives a
score of zero.
Given the confidence score for each (location a58 ,
state a49 ) pair, the majority counting is based on the
cumulative confidence, a41a44a43a27a23a29a28 a31a31a30a33a32a34a30a33a35 a14 a58 a3a59a49a18a17 in the DB,
rather than on the cumulative count of occurrences
of this pair in the DB:
a52a29a36
a14a16a15a51a17a20a19
a52a20a19
a14a16a15a51a17
a6a42a41a44a43a18a45a81a2
a36
a14a50a49 a68 a3a26a15a51a17a53a19 a37
a37a39a38 a32a26a40a42a41a35a44a43a46a45a31a31a38 a39a48a47 a37a39a38
a41a44a43a27a23a29a28
a31a31a30a33a32a34a30a33a35
a14a16a15 a68 a17
Filtering through the story is also applied, as in
the previous method. The resulting method favors
more correct decisions, and fewer erroneous ones.
We should note here, that the notion of confidence
of a fill (here, the state fill) is naturally extended to
the notion of confidence of a record: For each of
62
the three key fills—location, date, disease name—
compute a confidence based on the same heuristics.
Then we say that a record a15 has high confidence, if it
has non-zero confidence in all three of the key fills.
The notion of record confidence is used in Section 6.
4.4 Multi-Candidate Propagation
Finally, we tried propagating multiple candidate
state hypotheses for each instance of an ambiguous
location name a58 :
a52a1a0
a14a16a15a51a17a53a19 a2
a37a39a38 a32a26a40 a41a43 a47 a37a39a38
a6a9a8a11a10 a8a11a2a46a49a10a22a17a23 a6a9a8 a43a18a45a25a24 a14a16a15 a68 a17
a6a42a41a44a43a18a45a47a2
a0
a14a50a49 a68 a3a26a15a51a17a20a19 a37
a37a39a38 a32a26a40a42a41a43a33a47 a37a39a38a4a3
a45a81a43a6a5a42a14a50a49 a68 a3a26a15 a68 a17
where the proximity is inversely proportional to the
distance of a49a51a68 from incident a15a5a68 , in the story of a15 a68 :
a3
a45a24a43a7a5 a14a50a49a2a3a26a15a51a17a20a19 a8a9
a10
a9a11
a7
a12a61a14a16a15a51a17a14a13
a7
a15 a14a50a49a2a3a26a15a51a17a17a16 a7 a18
a28a57a49 a60a62a15
a11 a43a18a8a20a19 a2a5a45a18a4
a18
a49a18a2
For an incident a15 mentioning location a58 , the IE sys-
tem outputs the list of all states a66a46a49a81a79 mentioned in
the same story; we then rank each a49 according to
the inverse of distance a15 : the number of sentences
between a15 and a49 . a12a61a14a16a15a51a17 is a normalization factor.
The proximity for each pair a14 a58 a3a59a49a18a17 , is between a11
and a7 . Rather than giving a full point to a single,
locally-best guess among the a49 ’s, this point is shared
proportionately among all competing a49 ’s. For exam-
ple, if states a49a7a21 a3a59a49a23a22a16a3a59a49a25a24 are in the same sentence as
a15 , one, and five sentences away, respectively, then
a12a61a14a16a15a51a17a20a19 a7a26a16
a22
a27
a16
a22
a28
a19
a24a29
, and
a3
a45a81a43a7a5a9a14a50a49 a21 a17 a19 a7
a13
a29
a24
a19
a29
a24 ,
a3
a45a81a43a7a5a9a14a50a49 a22 a17a53a19
a22
a27a30a13
a29
a24
a19
a29
a22a4a21 ,and
a3
a45a81a43a7a5a9a14a50a49 a24 a17a20a19
a22
a28a26a13
a29
a24
a19
a22
a22a4a21 .
The score for each state a49 for the given a58 is then
the sum of proximities of a49 to a58 across all stories.
The resulting performance is substantially bet-
ter than the baseline, while the number of changed
records is substantially higher than in the competing
methods. This is due to the fact that this method al-
lows for a much larger pool of candidates than the
others, and assigns to them much smoother weights,
virtually eliminating ties in the ranking among hy-
potheses.
5 Discussion
Among the four competing approaches presented
above, the baseline performs surprisingly well. We
should note that this research is not aimed specifi-
cally at improving geographic Named Entity resolu-
tion. It is the first in a series of experiments aiming
to leverage redundancy across a large fact base ex-
tracted from text, to improve the quality of extracted
data. We chose to experiment with this relation first
because of its simplicity, and because the state field
is a key field in our application.
For this reason, the a priori geographic knowl-
edge base was intentionally not as extensive as it
might have been, had we tried in earnest to match
locations with corresponding states (e.g., by incor-
porating the CIA Factbook, or other gazetteer).
The intent here is to investigate how a relation
can be improved by leveraging redundancy across
a large body of records. The support we used for ge-
ographic name resolution was therefore deliberately
modest, cf. Section 3.3.
It is quite feasible to enumerate the countries and
the larger regions, since they number in the low hun-
dreds, whereas there are many tens of thousands of
cities, towns, villages, regions, districts, etc.
6 Current Work
Three parallel lines of current research are:
1. combining evidence from multiple features
2. applying redundancy-based correction to other
fields in the database
3. back-propagation of corrected results, to repair
components that induced incorrect information.
The results so far presented show that even a
naive, intuitive approach can help correct local er-
rors via global analysis. We are currently working
on more complex extensions of these methods.
Each method exploits one main feature of the un-
derlying data: the distance from candidate state to
the mention of the location name. In the multi-
candidate hypothesis method, this distance is ex-
ploited explicitly in the scoring function. In the
other methods, it is used inside the co-reference
module of the IE pipeline, to find the (single)
locally-best state.
However, other textual features of the state can-
didate should contribute to establishing the relations
63
to a location mention, besides the raw distance. For
example, at a given distance, it is very important
whether the state is mentioned before the location
(more likely to be related) vs. after the location (less
likely). Another important feature: is the state men-
tioned in the main story/report headline? If so, its
score should be raised. It is quite common for doc-
uments to declaim the focal state only once in the
headline, and never mention it again, instead men-
tioning other states, neighboring, or otherwise rele-
vant to the story. The distance measure used alone
may be insufficient in such cases.
How are these features to be combined? One path
is to use some combination of features, such as a
weighted sum, with parameters trained on a man-
ually tagged data set. As we already have a rea-
sonably sized set tagged for evaluation, we can split
it into two, train the parameter on a larger portion,
evaluate on a smaller one, and cross-validate.
We will be using this approach as a baseline.
However, we aim to use a much larger set of data to
train the parameters, without manually tagging large
training sets.
The idea is to treat the set of incidents with high
record confidence, Sec. 4.3, rather than manually
tagged data, as ground truth. Again, there “con-
fident” truth will not be completely error-free, but
because error rates are lower among the confident
records, we may be able to leverage global analy-
sis to produce the desired effect: training parame-
ters for more complex models—involving multiple
features—for global re-ranking of decisions.
Conclusion
Our approach rests on the idea that evidence aggre-
gated across documents should help resolve difficult
problems at the level of a given document.
Our experiments confirm that aggregating global
information about related facts, and propagating lo-
cally non-best analyses through the pipeline, provide
powerful sources of additional evidence, which are
able to reverse incorrect decisions, based only on lo-
cal and a priori information.
The proposed approach requires no supervision or
training of any kind. It does, however require a sub-
stantial collection of evidence across a large body
of extracted records; this approach needs a “critical
mass” of data to be effective. Although large volume
of facts is usually not reported in classic IE experi-
ments, obtaining high volume should be natural in
principle.
Acknowledgement
We’d like to thank Winston Lin of New York Uni-
versity, who worked on an early version of these ex-
periments.
References
2004. Automatic content extraction.
A. Douthat. 1998. The Message Understanding Con-
ference scoring software user’s manual. In Proc. 7th
Message Understanding Conf. (MUC-7), Fairfax, VA.
R. Grishman, S. Huttunen, and R. Yangarber. 2003. In-
formation extraction for enhanced access to disease
outbreak reports. J. of Biomed. Informatics, 35(4).
S. Huttunen, R. Yangarber, and R. Grishman. 2002.
Complexity of event structure in information extrac-
tion. In Proc. 19th Intl. Conf. Computational Linguis-
tics (COLING 2002), Taipei.
H. Ji and R. Grishman. 2004. Applying coreference to
improve name recognition. In Proc. Reference Reso-
lution Wkshop, (ACL-2004), Barcelona, Spain.
H. Ji and R. Grishman. 2005. Improving name tagging
by reference resolution and relation detection. In Proc.
ACL-2005, Amherst, Mass.
A. McCallum and D. Jensen. 2003. A note on the uni-
fication of information extraction and data mining us-
ing conditional-probability, relational models. In IJ-
CAI’03 Workshop on Learning Statistical Models from
Relational Data.
2004. Medical subject headings.
U. Y. Nahm and R. Mooney. 2000a. A mutually benefi-
cial integration of data mining and information extrac-
tion. In AAAI-2000, Austin, TX.
U. Y. Nahm and R. Mooney. 2000b. Using information
extraction to aid the discovery of prediction rules from
text. In KDD-2000 Text Mining Wkshop, Boston, MA.
R. Yangarber, L. Jokipii, A. Rauramo, and S. Hut-
tunen. 2005. Extracting information about outbreaks
of infectious epidemics. In Proc. HLT-EMNLP 2005
Demonstrations, Vancouver, Canada.
64
