A Hybrid Approach for the Acquisition of
Information Extraction Patterns
Mihai Surdeanu, Jordi Turmo, and Alicia Ageno
Technical University of Catalunya
Barcelona, Spain
{surdeanu,turmo,ageno}@lsi.upc.edu
Abstract
In this paper we present a hybrid ap-
proach for the acquisition of syntactico-
semantic patterns from raw text. Our
approach co-trains a decision list learner
whose feature space covers the set of all
syntactico-semantic patterns with an Ex-
pectation Maximization clustering algo-
rithm that uses the text words as attributes.
We show that the combination of the two
methods always outperforms the decision
list learner alone. Furthermore, using a
modular architecture we investigate sev-
eral algorithms for pattern ranking, the
most important component of the decision
list learner.
1 Introduction
Traditionally, Information Extraction (IE) identi-
fies domain-specific events, entities, and relations
among entities and/or events with the goals of:
populating relational databases, providing event-
level indexing in news stories, feeding link discov-
ery applications, etcetera.
By and large the identification and selective ex-
traction of relevant information is built around a
set of domain-specific linguistic patterns. For ex-
ample, for a “financial market change” domain
one relevant pattern is <NOUN fall MONEY
to MONEY>. When this pattern is matched on
the text “London gold fell $4.70 to $308.35”, a
change of $4.70 is detected for the financial in-
strument “London gold”.
Domain-specific patterns are either hand-
crafted or acquired automatically (Riloff, 1996;
Yangarber et al., 2000; Yangarber, 2003; Steven-
son and Greenwood, 2005). To minimize annota-
tioncosts,someofthelatterapproachesuselightly
supervised bootstrapping algorithms that require
as input only a small set of documents annotated
with their correspondingcategory label. The focus
of this paper is to improve such lightly supervised
pattern acquisition methods. Moreover, we focus
on robust bootstrapping algorithms that can han-
dle real-world document collections, which con-
tain many domains.
Although a rich literature covers bootstrap-
ping methods applied to natural language prob-
lems (Yarowsky, 1995; Riloff, 1996; Collins and
Singer, 1999; Yangarber et al., 2000; Yangar-
ber, 2003; Abney, 2004) several questions remain
unanswered when these methods are applied to
syntactic or semantic pattern acquisition. In this
paper we answer two of these questions:
(1) Can pattern acquisition be improved with
text categorization techniques?
Bootstrapping-based pattern acquisition algo-
rithms can also be regarded as incremental text
categorization (TC), since in each iteration docu-
ments containing certain patterns are assigned the
corresponding category label. Although TC is ob-
viously not the main goal of pattern acquisition
methodologies,itisneverthelessanintegralpartof
the learning algorithm: each iteration of the acqui-
sition algorithm depends on the previous assign-
ments of category labels to documents. Hence, if
the quality of the TC solution proposed is bad, the
quality of the acquired patterns will suffer.
Motivated by this observation, we introduce a
co-training-based algorithm (Blum and Mitchell,
1998) that uses a text categorization algorithm as
reinforcement for pattern acquisition. We show,
using both a direct and an indirect evaluation, that
the combination of the two methodologies always
improves the quality of the acquired patterns.
48
(2) Which pattern selection strategy is best?
Whilemostbootstrapping-basedalgorithmsfol-
low the same framework, they vary significantly
in what they consider the most relevant patterns in
each bootstrapping iteration. Several approaches
have been proposed in the context of word sense
disambiguation (Yarowsky, 1995), named entity
(NE) classification (Collins and Singer, 1999),
patternacquisitionforIE(Riloff,1996; Yangarber,
2003), or dimensionality reduction for text catego-
rization (TC) (Yang and Pedersen, 1997). How-
ever, it is not clear which selection approach is
the best for the acquisition of syntactico-semantic
patterns. To answer this question, we have im-
plemented a modular pattern acquisition architec-
ture where several of these ranking strategies are
implemented and evaluated. The empirical study
presented in this paper shows that a strategy previ-
ouslyproposedforfeaturerankingforNErecogni-
tion outperforms algorithms designed specifically
for pattern acquisition.
The paper is organized as follows: Sec-
tion 2 introduces the bootstrapping framework
used throughout the paper. Section 3 introduces
the data collections. Section 4 describes the di-
rect and indirect evaluation procedures. Section 5
introduces a detailed empirical evaluation of the
proposed system. Section 6 concludes the paper.
2 The Pattern Acquisition Framework
In this section we introduce a modular pattern ac-
quisition framework that co-trains two different
views of the document collection: the first view
uses the collection words to train a text categoriza-
tion algorithm, while the second view bootstraps
a decision list learner that uses all syntactico-
semantic patterns as features. The rules acquired
by the latter algorithm, of the form p → y, where
p is a pattern and y is a domain label, are the out-
put of the overall system. The system can be cus-
tomized with several pattern selection strategies
that dramatically influence the quality and order
of the acquired rules.
2.1 Co-training Text Categorization and
Pattern Acquisition
Given two views of a classification task, co-
training (Blum and Mitchell, 1998) bootstraps a
separate classifier for each view as follows: (1)
it initializes both classifiers with the same small
amount of labeled data (i.e. seed documents in our
case); (2) it repeatedly trains both classifiers us-
ing the currently labeled data; and (3) after each
learning iteration, the two classifiers share all or a
subset of the newly labeled examples (documents
in our particular case).
The intuition is that each classifier provides
new, informative labeled data to the other classi-
fier. If the two views are conditional independent
and the two classifiers generally agree on unla-
beled data they will have low generalization error.
In this paper we focus on a “naive” co-training ap-
proach, which trains a different classifier in each
iteration and feeds its newly labeled examples to
the other classifier. This approach was shown to
perform well on real-world natural language prob-
lems (Collins and Singer, 1999).
Figure 1 illustrates the co-training framework
used in this paper. The feature space of the
first view contains only lexical information, i.e.
the collection words, and uses as classifier Ex-
pectation Maximization (EM) (Dempster et al.,
1977). EM is actually a class of iterative algo-
rithms that find maximum likelihood estimates of
parametersusingprobabilisticmodelsoverincom-
plete data (e.g. both labeled and unlabeled docu-
ments) (Dempster et al., 1977). EM was theoret-
ically proven to converge to a local maximum of
the parameters’ log likelihood. Furthermore, em-
pirical experiments showed that EM has excellent
performance for lightly-supervised text classifica-
tion (Nigam et al., 2000). The EM algorithm used
in this paper estimates its model parameters us-
ing the Naive Bayes (NB) assumptions, similarly
to (Nigam et al., 2000). From this point further,
we refer to this instance of the EM algorithm as
NB-EM.
The feature space of the second view contains
the syntactico-semantic patterns, generated using
the procedure detailed in Section 3.2. The second
learner is the actual pattern acquisition algorithm
implemented as a bootstrapped decision list clas-
sifier.
The co-training algorithm introduced in this pa-
per interleaves one iteration of the NB-EM algo-
rithm with one iteration of the pattern acquisition
algorithm. If one classifier converges faster (e.g.
NB-EM typically converges in under 20 iterations,
whereas the acquisition algorithms learns new pat-
terns for hundreds of iterations) we continue boot-
strapping the other classifier alone.
2.2 The Text Categorization Algorithm
The parameters of the generative NB model, ˆθ, in-
clude the probability of seeing a given category,
49
pattern
Initialize
acquisition
Labeled seed documents
Unlabeled documents Iteration
NB−EM Patternacquisition
iteration
Pattern
acquisition
terminated?
NB−EM
converged?
Ranking
method
Initialize
NB−EM
No
Yes
Patterns
Yes
No
Figure 1: Co-training framework for pattern acquisition.
1. Initialization:
• Initialize the set of labeled examples with n la-
beled seed documents of the form (di, yi). yi is
the label of the ith document di. Each docu-
ment di contains a set of patterns {pi1,pi2,...,pim}.
• Initialize the list of learned rules R = {}.
2. Loop:
• For each label y, select a small set of pattern
rules r = p → y, r /∈ R.
• Append all selected rules r to R.
• For all non-seed documents d that contain a
pattern in R, set label(d) = argmaxp,y strength(p,y).
3. Termination condition:
• Stop if no rules selected or maximum number
of iterations reached.
Figure 2: Pattern acquisition meta algorithm
P(c|ˆθ), and the probability of seeing a word given
a category, P(w|c; ˆθ). We calculate both simi-
larly to Nigam (2000). Using these parameters,
the word independence assumption typical to the
Naive Bayes model, and the Bayes rule, the prob-
ability that a document d has a given category c is
calculated as:
P(c|d; ˆθ) = P(c|
ˆθ)P(d|c; ˆθ)
P(d|ˆθ) (1)
= P(c|
ˆθ)Π|d|i=1P(wi|c; ˆθ)
summationtextq
j=1 P(cj|
ˆθ)Π|d|i=1P(wi|cj; ˆθ) (2)
2.3 The Pattern Acquisition Algorithm
The lightly-supervised pattern acquisition algo-
rithm iteratively learns domain-specific IE pat-
terns from a small set of labeled documents and
a much larger set of unlabeled documents. Dur-
ing each learning iteration, the algorithm acquires
a new set of patterns and labels more documents
based on the new evidence. The algorithm output
is a list R of rules p → y, where p is a pattern
in the set of patterns P, and y a category label in
Y = {1...k}, k being the number of categories in
the document collection. The list of acquired rules
R is sorted in descending order of rule importance
to guarantee that the most relevant rules are ac-
cessed first. This generic bootstrapping algorithm
is formalized in Figure 2.
Previous studies called the class of algorithms
illustrated in Figure 2 “cautious” or “sequential”
because in each iteration they acquire 1 or a small
set of rules (Abney, 2004; Collins and Singer,
1999). This strategy stops the algorithm from be-
ing over-confident, an important restriction for an
algorithm that learns from large amounts of unla-
beled data. This approach was empirically shown
to perform better than a method that in each itera-
tionacquiresall rulesthatmatchacertaincriterion
(e.g. the corresponding rule has a strength over a
certain threshold).
The key element where most instances of this
algorithm vary is the select procedure, which de-
cides which rules are acquired in each iteration.
Although several selection strategies have been
previously proposed for various NLP problems, to
our knowledge no existing study performs an em-
pirical analysis of such strategies in the context of
acquisition of IE patterns. For this reason, we im-
plement several selection methods in our system
(described in Section 2.4) and evaluate their per-
formance in Section 5.
The label of each collection document is given
bythestrengthofitspatterns. Similarlyto(Collins
and Singer, 1999; Yarowsky, 1995), we define the
strength of a pattern p in a category y as the pre-
cision of p in the set of documents labeled with
category y, estimated using Laplace smoothing:
strength(p,y) = count(p,y) + epsilon1count(p) + kepsilon1 (3)
where count(p,y) is the number of documents la-
beled y containing pattern p, count(p) is the over-
all number of labeled documents containing p, and
k is the number of domains. For all experiments
presented here we used epsilon1 = 1.
Another point where acquisition algorithms dif-
feristheinitializationprocedure: somestartwitha
small number of hand-labeled documents (Riloff,
1996), as illustrated in Figure 2, while others start
with a set of seed rules (Yangarber et al., 2000;
Yangarber, 2003). However, these approaches are
conceptually similar: the seed rules are simply
used to generate the seed documents.
Thispaperfocusesontheframeworkintroduced
in Figure 2 for two reasons: (a) “cautious” al-
50
gorithms were shown to perform best for several
NLP problems (including acquisition of IE pat-
terns), and (b) it has nice theoretical properties:
Abney (2004) showed that, regardless of the selec-
tion procedure, “sequential” bootstrapping algo-
rithms converge to a local minimum of K, where
K is an upper bound of the negative log likelihood
of the data. Obviously, the quality of the local
minimum discovered is highly dependent of the
selection procedure, which is why we believe an
evaluation of several pattern selection strategies is
important.
2.4 Selection Criteria
The pattern selection component, i.e. the select
procedure of the algorithm in Figure 2, consists of
the following: (a) for each category y all patterns
p are sorted in descending order of their scores in
the current category, score(p,y), and (b) for each
category the top k patterns are selected. For all
experiments in this paper we have used k = 3.
We provide four different implementations for the
pattern scoring function score(p,y) according to
four different selection criteria.
Criterion 1: Riloff
This selection criterion was developed specifically
for the pattern acquisition task (Riloff, 1996) and
has been used in several other pattern acquisition
systems (Yangarber et al., 2000; Yangarber, 2003;
Stevenson and Greenwood, 2005). The intuition
behind it is that a qualitative pattern is yielded by a
compromise between pattern precision (which is a
good indicator of relevance) and pattern frequency
(which is a good indicator of coverage). Further-
more, the criterion considers only patterns that are
positively correlated with the corresponding cate-
gory, i.e. their precision is higher than 50%. The
Riloff score of a pattern p in a category y is for-
malized as:
score(p,y) =
braceleftBigg prec(p,y)log(count(p,y)),
if prec(p,y) > 0.5;
0,otherwise.
(4)
prec(p,y) = count(p,y)count(p) (5)
where prec(p,y) is the raw precision of pattern p
in the set of documents labeled with category y.
Criterion 2: Collins
This criterion was used in a lightly-supervised NE
recognizer (Collins and Singer, 1999). Unlike the
previous criterion, which combines relevance and
frequency in the same scoring function, Collins
considers only patterns whose raw precision is
over a hard threshold T and ranks them by their
global coverage:
score(p,y) =
braceleftbigg
count(p), if prec(p,y) > T;
0, otherwise. (6)
Similarly to (Collins and Singer, 1999) we used
T = 0.95 for all experiments reported here.
Criterion 3: χ2 (Chi)
The χ2 score measures the lack of independence
between a pattern p and a category y. It is com-
puted using a two-way contingency table of p and
y,whereaisthenumberoftimespandy co-occur,
b is the number of times p occurs without y, c is
the number of times y occurs without p, and d is
the number of times neither p nor y occur. The
number of documents in the collection is n. Sim-
ilarly to the first criterion, we consider only pat-
terns positively correlated with the corresponding
category:
score(p,y) =
braceleftbigg
χ2(p,y), if prec(p,y) > 0.5;
0, otherwise. (7)
χ2(p,y) = n(ad−cb)
2
(a + c)(b + d)(a + b)(c + d) (8)
The χ2 statistic was previously reported to be
the best feature selection strategy for text catego-
rization (Yang and Pedersen, 1997).
Criterion 4: Mutual Information (MI)
Mutual information is a well known information
theory criterion that measures the independence of
two variables, in our case a pattern p and a cate-
goryy (YangandPedersen, 1997). Usingthesame
contingency table introduced above, the MI crite-
rion is estimated as:
score(p,y) =
braceleftbigg
MI(p,y), if prec(p,y) > 0.5;
0, otherwise. (9)
MI(p,y) = log P(p∧y)P(p)×P(y) (10)
≈ log na(a + c)(a + b) (11)
3 The Data
3.1 The Document Collections
For all experiments reported in this paper we used
the following three document collections: (a) the
AP collection is the Associated Press (year 1999)
subset of the AQUAINT collection (LDC catalog
number LDC2002T31); (b) the LATIMES collec-
tion is the Los Angeles Times subset of the TREC-
5 collection1; and (c) the REUTERS collection is
the by now classic Reuters-21578 text categoriza-
tion collection2.
1http://trec.nist.gov/data/docs eng.html
2http://trec.nist.gov/data/reuters/reuters.html
51
Collection # of docs # of categories # of words # of patterns
AP 5000 7 24812 140852
LATIMES 5000 8 29659 69429
REUTERS 9035 10 12905 36608
Table 1: Document collections used in the evaluation
Similarly to previous work, for the REUTERS
collection we used the ModApte split and selected
the ten most frequent categories (Nigam et al.,
2000). Due to memory limitations on our test ma-
chines, we reduced the size of the AP and LA-
TIMES collections to their first 5,000 documents
(the complete collections contain over 100,000
documents).
The collection words were pre-processed as fol-
lows: (i) stop words and numbers were discarded;
(ii) all words were converted to lower case; and
(iii) terms that appear in a single document were
removed. Table 1 lists the collection characteris-
tics after pre-processing.
3.2 Pattern Generation
In order to extract the set of patterns available in
a document, each collection document undergoes
the following processing steps: (a) we recognize
and classify named entities3, and (b) we generate
full parse trees of all document sentences using a
probabilistic context-free parser.
Following the above processing steps, we ex-
tractSubject-Verb-Object(SVO)tuplesusingase-
ries of heuristics, e.g.: (a) nouns preceding active
verbs are subjects, (b) nouns directly attached to a
verb phrase are objects, (c) nouns attached to the
verb phrase through a prepositional attachment are
indirect objects. Each tuple element is replaced
with either its head word, if its head word is not
included in a NE, or with the NE category oth-
erwise. For indirect objects we additionally store
the accompanying preposition. Lastly, each tuple
containing more than two elements is generalized
by maintaining only subsets of two and three of its
elements and replacing the others with a wildcard.
Table 2 lists the patterns extracted from one
sample sentence. As Table 2 hints, the system
generates a large number of candidate patterns. It
is the task of the pattern acquisition algorithm to
extract only the relevant ones from this complex
search space.
4 The Evaluation Procedures
4.1 The Indirect Evaluation Procedure
The goal of our evaluation procedure is to measure
the quality of the acquired patterns. Intuitively,
3We identify six categories: persons, locations, organiza-
tions, other names, temporal and numerical expressions.
Text The Minnesota Vikings beat the Arizona
Cardinals in yesterday’s game.
Patterns s(ORG) v(beat)
v(beat) o(ORG)
s(ORG) o(ORG)
v(beat) io(in game)
s(ORG) io(in game)
o(ORG) io(in game)
s(ORG) v(beat) o(ORG)
s(ORG) v(beat) io(in game)
v(beat) o(ORG) io(in game)
Table 2: Patterns extracted from one sample sentence. s
stands for subject, v for verb, o for object, and io for indirect
object.
thelearnedpatternsshouldhavehighcoverageand
low ambiguity. We indirectly measure the quality
of the acquired patterns using a text categorization
strategy: we feed the acquired rules to a decision-
list classifier, which is then used to classify a new
set of documents. The classifier assigns to each
document the category label given by the first rule
whose pattern matches. Since we expect higher-
quality patterns to appear higher in the rule list,
the decision-list classifier never changes the cate-
gory of an already-labeled document.
The quality of the generated classification is
measured using micro-averaged precision and re-
call:
P =
summationtextq
i=1 TruePositivesisummationtextq
i=1(TruePositivesi + FalsePositivesi)
(12)
R =
summationtextq
i=1 TruePositivesisummationtextq
i=1(TruePositivesi + FalseNegativesi)
(13)
where q is the number of categories in the docu-
ment collection.
For all experiments and all collections with the
exception of REUTERS, which has a standard
document split for training and testing, we used 5-
fold cross validation: we randomly partitioned the
collections into 5 sets of equal sizes, and reserved
a different one for testing in each fold.
Wehavechosenthisevaluationstrategybecause
this indirect approach was shown to correlate well
withadirectevaluation,wherethelearnedpatterns
were used to customize an IE system (Yangarber
et al., 2000). For this reason, much of the fol-
lowing work on pattern acquisition has used this
approach as a de facto evaluation standard (Yan-
garber, 2003; Stevenson and Greenwood, 2005).
Furthermore, given the high number of domains
and patterns (we evaluate on 25 domains), an eval-
uation by human experts is extremely costly. Nev-
ertheless, to show that the proposed indirect eval-
uation correlates well with a direct evaluation, two
human experts have evaluated the patterns in sev-
eral domains. The direct evaluation procedure is
described next.
52
4.2 The Direct Evaluation Procedure
The task of manually deciding whether an ac-
quired pattern is relevant or not for a given domain
is not trivial, mainly due to the ambiguity of the
patterns. Thus, this process should be carried out
by more than one expert, so that the relevance of
the ambiguous patterns can be agreed upon. For
example, the patterns s(ORG) v(score) o(goal) and
s(PER) v(lead) io(with point) are clearly relevant
only for the sports domain, whereas the patterns
v(sign) io(as agent) and o(title) io(in DATE) might
be regarded as relevant for other domains as well.
The specific procedure to manually evaluate the
patterns is the following: (1) two experts sepa-
rately evaluate the acquired patterns for the con-
sidered domains and collections; and (2) the re-
sults of both evaluations are compared. For any
disagreement, we have opted for a strict evalua-
tion: all the occurrences of the corresponding pat-
tern are looked up in the collection and, whenever
at least one pattern occurrence belongs to a docu-
ment assigned to a different domain than the do-
main in question, the pattern will be considered as
not relevant.
Both the ambiguity and the high number of
the extracted patterns have prevented us from per-
forming an exhaustive direct evaluation. For this
reason, only the top (most relevant) 100 patterns
havebeenevaluatedforonedomainpercollection.
The results are detailed in Section 5.2.
5 Experimental Evaluation
5.1 Indirect Evaluation
For a better understanding of the proposed ap-
proach we perform an incremental evaluation:
first, we evaluate only the various pattern selection
criteria described in Section 2.4 by disabling the
NB-EM component. Second, using the best selec-
tion criteria, we evaluate the complete co-training
system.
In both experiments we initialize the system
with high-precision manually-selected seed rules
which yield seed documents with a coverage of
10% ofthe trainingpartitions. The remaining 90%
of the training documents are maintained unla-
beled. For all experiments we used a maximum of
400 bootstrapping iterations. The acquired rules
are fed to the decision list classifier which assigns
category labels to the documents in the test parti-
tions.
Evaluation of the pattern selection criteria
Figure 3 illustrates the precision/recall charts
of the four algorithms as the number of patterns
made available to the decision list classifier in-
creases. All charts show precision/recall points
starting after 100 learning iterations with 100-
iteration increments. It is immediately obvious
that the Collins selection criterion performs sig-
nificantly better than the other three criteria. For
the same recall point, Collins yields a classifica-
tion model with much higher precision, with dif-
ferences ranging from 5% in the REUTERS col-
lection to 20% in the AP collection.
Theorem 5 in (Abney, 2002) provides a theo-
retical explanation for these results: if certain in-
dependence conditions between the classifier rules
are satisfied and the precision of each rule is larger
than a threshold T, then the precision of the final
classifier is larger than T. Although the rule inde-
pendence conditions are certainly not satisfied in
our real-world evaluation, the above theorem in-
dicates that there is a strong relation between the
precision of the classifier rules on labeled data and
theprecisionofthefinalclassifier. Ourresultspro-
vide the empirical proof that controling the preci-
sion of the acquired rules (i.e. the Collins crite-
rion) is important.
The Collins criterion controls the recall of the
learned model by favoring rules with high fre-
quency in the collection. However, since the other
two criteria do not use a high precision thresh-
old, they will acquire more rules, which translates
in better recall. For two out of the three collec-
tions, Riloff and Chi obtain a slightly better recall,
about 2% higher than Collins’, albeit with a much
lower precision. We do not consider this an im-
portant advantage: in the next section we show
that co-training with the NB-EM component fur-
ther boosts the precision and recall of the Collins-
based acquisition algorithm.
The MI criterion performs the worst of the four
evaluated criteria. A clue for this behavior lies in
the following equivalent form for MI: MI(p,y) =
logP(p|y)−logP(p). Thisformulaindicatesthat,
for patterns with equal conditional probabilities
P(p|y), MI assigns higher scores to patterns with
lower frequency. This is not the desired behavior
in a TC-oriented system.
Evaluation of the co-training system
Figure 4 compares the performance of the
stand-alone pattern acquisition algorithm (“boot-
strapping”) with the performance of the acquisi-
tion algorithm trained in the co-training environ-
53
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.15  0.2  0.25  0.3  0.35  0.4  0.45  0.5  0.55
Precision
Recall
collins
riloff
chi
mi
(a)
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.1  0.15  0.2  0.25  0.3  0.35  0.4
Precision
Recall
collins
riloff
chi
mi
(b)
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 0.1  0.15  0.2  0.25  0.3  0.35  0.4  0.45
Precision
Recall
collins
riloff
chi
mi
(c)Figure 3:
Performance of the pattern acquisition algorithm for various pattern selection strategies and multiple collections:
(a) AP, (b) LATIMES, and (c) REUTERS
ment (“co-training”). For both setups we used the
best pattern selection criterion for pattern acqui-
sition, i.e. the Collins criterion. To put things in
perspective, we also depict the performance ob-
tained with a baseline system, i.e. the system con-
figured to use the Riloff pattern selection criterion
and without the NB-EM algorithm (“baseline”).
To our knowledge, this system, or a variation of
it, is the current state-of-the-art in pattern acqui-
sition (Riloff, 1996; Yangarber et al., 2000; Yan-
garber, 2003; Stevenson and Greenwood, 2005).
All algorithms were initialized with the same seed
rules and had access to all documents.
Figure 4 shows that the quality of the learned
patterns always improves if the pattern acquisi-
tion algorithm is “reinforced” with EM. For the
same recall point, the patterns acquired in the
co-training environment yield classification mod-
els with precision (generally) much larger than
the models generated by the pattern acquisition
algorithm alone. When using the same pat-
tern acquisition criterion, e.g. Collins, the dif-
ferences between the co-training approach and
the stand-alone pattern acquisition method (“boot-
strapping”) range from 2-3% in the REUTERS
collection to 20% in the LATIMES collection.
These results support our intuition that the sparse
pattern space is insufficient to generate good clas-
sification models, which directly influences the
quality of all acquired patterns.
Furthermore, due to the increased coverage of
the lexicalized collection views, the patterns ac-
quired in the co-training setup generally have bet-
ter recall, up to 11% higher in the LATIMES col-
lection.
Lastly, the comparison of our best system (“co-
training”) against the current state-of-the-art (our
“baseline”) draws an even more dramatic picture:
Collection Domain Relevant Relevant Initial
patterns patterns inter-expert
baseline co-training agreement
AP Sports 22% 68% 84%
LATIMES Financial 67% 76% 70%
REUTERS Corporate 38% 46% 66%
Acquisitions
Table 3: Percentage of relevant patterns for one domain per
collection by the baseline system (Riloff) and the co-training
system.
for the same recall point, the co-training system
obtains a precision up to 35% higher for AP and
LATIMES, and up to 10% higher for REUTERS.
5.2 Direct Evaluation
As stated in Section 4.2, two experts have man-
ually evaluated the top 100 acquired patterns for
one different domain in each of the three collec-
tions. The three corresponding domains have been
selected intending to deal with different degrees of
ambiguity, which are reflected in the initial inter-
expert agreement. Any disagreement between ex-
perts is solved using the algorithm introduced in
Section 4.2. Table 3 shows the results of this di-
rect evaluation. The co-training approach outper-
forms the baseline for all three collections. Con-
cretely, improvements of 9% and 8% are achieved
for the Financial and the Corporate Acquisitions
domains, and 46%, by far the largest difference, is
found for the Sports domain in AP. Table 4 lists
the top 20 patterns extracted by both approaches
in the latter domain. It can be observed that for
the baseline, only the top 4 patterns are relevant,
the rest being extremely general patterns. On the
other hand, the quality of the patterns acquired by
our approach is much higher: all the patterns are
relevant to the domain, although 7 out of the 20
might be considered ambiguous and according to
the criterion defined in Section 4.2 have been eval-
uated as not relevant.
54
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 0.3  0.35  0.4  0.45  0.5  0.55  0.6
Precision
Recall
co-training
bootstrapping
baseline
(a)
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.2  0.25  0.3  0.35  0.4  0.45
Precision
Recall
co-training
bootstrapping
baseline
(b)
 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 0.15  0.2  0.25  0.3  0.35  0.4  0.45
Precision
Recall
co-training
bootstrapping
baseline
(c)Figure4:
Comparisonofthebootstrappingpatternacquisitionalgorithmwiththeco-trainingapproach: (a)AP,(b)LATIMES,
and (c) REUTERS
Baseline Co-training
s(he) o(game) v(win) o(title)
v(miss) o(game) s(I) v(play)
v(play) o(game) s(he) v(game)
v(play) io(in LOC) s(we) v(play)
v(go) o(be) v(miss) o(game)
s(he) v(be) s(he) v(coach)
s(that) v(be) v(lose) o(game)
s(I) v(be) s(I) o(play)
s(it) v(go) o(be) v(make) o(play)
s(it) v(be) v(play) io(in game)
s(I) v(think) v(want) o(play)
s(I) v(know) v(win) o(MISC)
s(I) v(want) s(he) o(player)
s(there) v(be) v(start) o(game)
s(we) v(do) s(PER) o(contract)
v(do) o(it) s(we) o(play)
s(it) o(be) s(team) v(win)
s(we) v(are) v(rush) io(for yard)
s(we) v(go) s(we) o(team)
s(PER) o(DATE) v(win) o(Bowl)
Table 4: Top 20 patterns acquired from the Sports domain
by the baseline system (Riloff) and the co-training system for
the AP collection. The correct patterns are in bold.
6 Conclusions
This paper introduces a hybrid, lightly-supervised
method for the acquisition of syntactico-semantic
patterns for Information Extraction. Our approach
co-trains a decision list learner whose feature
space covers the set of all syntactico-semantic
patterns with an Expectation Maximization clus-
tering algorithm that uses the text words as at-
tributes. Furthermore, we customize the decision
list learner with up to four criteria for pattern se-
lection, which is the most important component of
the acquisition algorithm.
For the evaluation of the proposed approach we
have used both an indirect evaluation based on
Text Categorization and a direct evaluation where
human experts evaluated the quality of the gener-
ated patterns. Our results indicate that co-training
the Expectation Maximization algorithm with the
decision list learner tailored to acquire only high
precision patterns is by far the best solution. For
the same recall point, the proposed method in-
creases the precision of the generated models up
to 35% from the previous state of the art. Further-
more, the combination of the two feature spaces
(words and patterns) also increases the coverage
of the acquired patterns. The direct evaluation of
the acquired patterns by the human experts vali-
dates these results.

References
S. Abney. 2002. Bootstrapping. In Proceedings of the 40th
Annual Meeting of the Association for Computational Lin-
guistics.
S. Abney. 2004. Understanding the Yarowsky algorithm.
Computational Linguistics, 30(3).
A. Blum and T. Mitchell. 1998. Combining labeled and un-
labeled data with co-training. In Proceedings of the 11th
Annual Conference on Computational Learning Theory.
M. Collins and Y. Singer. 1999. Unsupervised models for
named entity classification. In Proceedings of EMNLP.
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Max-
imum likelihood from incomplete data via the EM algo-
rithm. Journal of the Royal Statistical Society, Series B,
39(1).
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. 2000.
Text classification from labeled and unlabeled documents
using EM. Machine Learning, 39(2/3).
E.Riloff. 1996. Automaticallygeneratingextractionpatterns
from untagged text. In Proceedings of the Thirteenth Na-
tional Conference on Artificial Intelligence (AAAI-96).
M. Stevenson and M. Greenwood. 2005. A semantic ap-
proach to ie pattern induction. In Proceedings of the 43rd
Meeting of the Association for Computational Linguistics.
Y. Yang and J. O. Pedersen. 1997. A comparative study
on feature selection in text categorization. In Proceed-
ings of the Fourteenth International Conference on Ma-
chine Learning.
R. Yangarber, R. Grishman, P. Tapanainen, and S. Hutunen.
2000. Automatic acquisition of domain knowledge for in-
formation extraction. In Proceedings of the 18th Interna-
tional Conference of Computational Linguistics (COLING
2000).
R. Yangarber. 2003. Counter-training in discovery of se-
mantic patterns. In Proceedings of the 41st Annual Meet-
ing of the Association for Computational Linguistics (ACL
2003).
D. Yarowsky. 1995. Unsupervised word sense disambigua-
tion rivaling supervised methods. In Proceedings of the
33rd Annual Meeting of the Association for Computa-
tional Linguistics.
