Learning Effective Surface Text Patterns
for Information Extraction
Gijs Geleijnse and Jan Korst
Philips Research Laboratories
Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands
{gijs.geleijnse,jan.korst}@philips.com
Abstract
We present a novel method to identify ef-
fective surface text patterns using an inter-
net search engine. Precision is only one
of the criteria to identify the most effec-
tive patterns among the candidates found.
Anotheraspectisfrequencyofoccurrence.
Also, a pattern has to relate diverse in-
stances if it expresses a non-functional re-
lation. The learned surface text patterns
are applied in an ontology population al-
gorithm, which not only learns new in-
stances of classes but also new instance-
pairs of relations. We present some £rst
experiments with these methods.
1 Introduction
Ravichandran and Hovy (2002) present a method
to automatically learn surface text patterns ex-
pressing relations between instances of classes us-
ing a search engine. Their method, based on
a training set, identi£es natural language surface
text patterns that express some relation between
two instances. For example, “was born in” proved
to be a precise pattern expressing the relation be-
tween instances Mozart (of class ‘person’) and
1756 (of class ‘year’).
We address the issue of learning surface text
patterns, since we observed two drawbacks of
Ravichandran and Hovy’s work with respect to the
application of such patterns in a general informa-
tion extraction setting.
The £rst drawback is that Ravichandran and
Hovy focus on the use of such surface text patterns
to answer so-called factoid questions (Voorhees,
2004). They use the assumption that each instance
is related by R to exactly one other instance of
some class. In a general information extraction
setting, we cannot assume that all relations are
functional.
The second drawback is that the criterion for se-
lecting patterns, precision, is not the only issue for
a pattern to be effective. We call a pattern effec-
tive, if it links many different instance-pairs in the
excerpts found with a search engine.
We use an ontology to model the information
domain we are interested in. Our goal is to pop-
ulate an ontology with the information extracted.
In an ontology, instances of one class can be re-
lated by some relation R to multiple instances of
some other class. For example, we can identify
the classes ‘movie’ and ‘actor’ and the ‘acts in’-
relation, which is a many-to-many relation. In
general, multiple actors star in a single movie and
a single actor stars in multiple movies.
In this paper we present a domain-independent
method to learn effective surface text patterns rep-
resenting relations. Since not all patterns found
are highly usable, we formulate criteria to select
the most effective ones. We show how such pat-
terns can be used to populate an ontology.
The identi£cation of effective patterns is impor-
tant, since we want to perform as few queries to
a search engine as possible to limit the use of its
services.
This paper is organized as follows. After de£n-
ing the problem (Section 2) and discussing related
work (Section 3), we present an algorithm to learn
effectivesurfacetextpatternsinSection4. Wedis-
cuss the application of this method in an ontology
population algorithm in Section 5. In Section 6,
we present some of our early experiments. Sec-
tions 7 and 8 handle conclusions and future work.
1
2 Problem description
We consider two classes cq and ca and the corre-
sponding non-empty sets of instances Iq and Ia.
Elements in the sets Iq and Ia are instances of cq
and ca respectively, and are known to us before-
hand. However, the sets I do not have to be com-
plete, i.e. not all possible instances of the corre-
sponding class have to be in the set I.
Moreover, we consider some rela-
tion R between these classes and give a
non-empty training set of instance-pairs
TR = {(x,y) | x ∈ Iq ∧ y ∈ Ia}, which
are instance-pairs that are known to be R-related.
Problem: Given the classes cq and ca, the sets
of instances Iq and Ia, a relation R and a set
of R-related instance-pairs TR, learn effective
surface text patterns that express the relation R.
Say, for example, we consider the classes ‘au-
thor’and‘booktitle’andtherelation‘haswritten’.
We assume that we know some related instance-
pairs , e.g. (‘Leo Tolstoy’, ‘War and Peace’) and
(‘G¨unter Grass’, ‘Die Blechtrommel’). We then
want to £nd natural language phrases that relate
authors to the titles of the books they wrote. Thus,
ifwequeryapatternincombinationwiththename
of an author (e.g. ‘Umberto Eco wrote’), we want
thesearchresultsofthisquerytocontainthebooks
by this author.
The population of an ontology can be seen as
a generalization of a question-answering setting.
Unlike question-answering, we are interested in
£nding all possible instance-pairs, not only the
pairs with one £xed instance (e.g. all ‘author’-
‘book’ pairs instead of only the pairs containing
a £xed author). Functional relations in an ontol-
ogy correspond to factoid questions, e.g. the pop-
ulation of the classes ‘person’ and ‘country’ and
the ‘was born in’-relation. Non-functional rela-
tions can be used to identify answers to list ques-
tions, for example “name all books written by
Louis-Ferdinand C´eline” or “which countries bor-
der Germany?”.
3 Related work
Brin identi£es the use of patterns in the discovery
of relations on the web (Brin, 1998). He describes
a website-dependent approach to identify hyper-
text patterns that express some relation. For each
web site, such patterns are learned and explored
to identify instances that are similarly related. In
(Agichtein and Gravano, 2000), such a system is
combined with a named-entity recognizer.
In (Craven et al., 2000) an ontology is popu-
lated by crawling a website. Based on tagged web
pages from other sites, rules are learned to extract
information from the website.
Research on named-entity recognition was ad-
dressed in the nineties at the Message Understand-
ing Conferences (Chinchor, 1998) and is contin-
ued for example in (Zhou and Su, 2002).
Automated part of speech tagging (Brill, 1992)
is a useful technique in term extraction (Frantzi
et al., 2000), a domain closely related to named-
entity recognition. Here, terms are extracted
with a prede£ned part-of-speech structure, e.g. an
adjective-noun combination. In (Nenadi´c et al.,
2002), methods are discussed to extract informa-
tion from natural language texts with the use of
both part of speech tags and hyponym patterns.
As referred to in the introduction, Ravichandran
and Hovy (2002) present a method to identify sur-
face text patterns using a web search engine. They
extract patterns expressing functional relations in
a factoid question answering setting. Selection of
the extracted patterns is based on the precision of
the patterns. For example, if the pattern ‘was born
in’ is identi£ed as a pattern for the pair (‘Mozart’,
‘Salzburg’), they compute precision as the num-
ber of excerpts containing ‘Mozart was born in
Salzburg’ divided by the number of excerpts with
‘Mozart was born in’.
Information extraction and ontologies creation
aretwocloselyrelated£elds. Forreliableinforma-
tion extraction, we need background information,
e.g. an ontology. On the other hand, we need in-
formation extraction to generate broad and highly
usableontologies. Anoverviewonontologylearn-
ing from text can be found in (Buitelaar et al.,
2005).
Early work (Hearst, 1998), describes the extrac-
tion of text patterns expressing WordNet-relations
(such as hyponym relations) from some corpus.
This work focusses merely on the identi£cation of
such text patterns (i.e. phrases containing both in-
stances of some related pair). Patterns found by
multiple pairs are suggested to be usable patterns.
KnowItAll is a hybrid named-entity extraction
system (Etzioni et al., 2005) that £nds lists of in-
stances of some class from the web using a search
engine. It combines Hearst patterns and learned
2
patterns for instances of some class to identify and
extract named-entities. Moreover, it uses adaptive
wrapper algorithms (Crescenzi and Mecca, 2004)
to extract information from html markup such as
tables.
Cimiano and Staab descibe a method to use
a search engine to verify a hypothesis relation
(2004). For example, if we are interested in the ‘is
a’orhyponymrelationandwehaveacandidatein-
stancepair(‘river’, ‘Nile’)forthisrelation, we can
use a search engine to query phrases expressing
this relation (e.g. ‘rivers such as the Nile’). The
number of hits to such queries can then be used as
a measure to determine the validity of the hypoth-
esis.
In (Geleijnse and Korst, 2005), a method is de-
scribed to populate an ontology with the use of
queried text patterns. The algorithm presented ex-
tracts instances from search results after having
submitted a combination of an instance and a pat-
tern as a query to a search engine. The extracted
instances from the retrieved excerpts can there-
after be used to formulate new queries – and thus
identify and extract other instances.
4 The algorithm
We present an algorithm to learn surface text pat-
terns for relations. We use GoogleTM to retrieve
such patterns.
The algorithm makes use of a training set TR
of instance-pairs that are R-related. This training
set should be chosen such the instance-pairs are
typical for relation R.
We £rst discover how relation R is expressed
in natural language texts on the web (Section 4.1).
In Section 4.2 we address the problem of select-
ing effective patterns from the total set of patterns
found.
4.1 Identifying relation patterns
We £rst generate a list of surface text patterns with
the use of the following algorithm. For evaluation
purposes, we also compute the frequency of each
pattern found.
- Step 1: Formulate queries using an instance-
pair (x,y) ∈ TR. Since we are interested in
phrases within sentences rather than in key-
words or expressions in telegram style that
often appear in titles of webpages, we use
the allintext: option. This gives us only
search results with the queried expression in
the bodies of the documents rather than in the
titles. We query both allintext:" x *
y " and allintext:" y * x ". The *
is a regular expression operator accepted by
Google. It is a placeholder for zero or more
words.
- Step 2: Send the queries to Google and col-
lect the excerpts of the at most 1,000 pages it
returns for each query.
- Step 3: Extract all phrases matching the
queried expressions and replace both x and
y by the names of their classes.
- Step 4: Remove all phrases that are not
within one sentence.
- Step 5: Normalize all phrases by removing
all mark-up that is ignored by Google. Since
Google is case-insensitive and ignores punc-
tuation, double spaces and the like, we trans-
late all phrases found to a normal form: the
simplest expression that we can query that
leads to the document retrieved.
- Step6: Update the frequencies of all normal-
ized phrases found.
- Step 7: Repeat the procedure for any un-
queried pair (xprime,yprime) ∈ TR.
We now have generated a list with relation pat-
terns and their frequencies within the retrieved
Google excerpts.
4.2 Selecting relation patterns
From the list of relation patterns found, we are in-
terested in the most effective ones.
We are not only interested in the most precise
ones. For example, the retrieved pattern “f¨odd 30
mars 1853 i” proved to a 100% precise pattern
expressing the relation between a person (‘Vin-
cent van Gogh’) and his place of birth (‘Zun-
dert’). Clearly, this rare phrase is unsuited to mine
instance-pairs of this relation in general. On the
other hand, high frequency of some pattern is no
guarantee for effectiveness either. The frequently
occurring pattern “was born in London” (found
when querying for ThomasBayes*England)
is well-suited to be used to £nd London-born per-
sons, but in general the pattern is unsuited – since
too narrow – to express the relation between a per-
son and his or her country of origin.
3
Taking these observations into account, we for-
mulate three criteria for selecting effective relation
patterns.
1. The patterns should frequently occur on the
web, toincreasetheprobabilityofgettingany
results when querying the pattern in combi-
nation with an instance.
2. The pattern should be precise. When we
query a pattern in combination with an in-
stance in Iq, we want to have many search
results containing instances from ca.
3. If relation R is not functional, the pattern
should be wide-spread, i.e. among the search
results when querying a combination of the
pattern and an instance in Iq there must be as
many distinct R-related instances from ca as
possible.
To measure these criteria, we use the following
scoring functions for relation patterns s.
1. ffreq(s) = “number of occurrences of s in
the excerpts as found by the algorithm de-
scribed in the previous subsection”
2. fprec(s) =
summationtext
x∈Iprimeq P(s,x)
|Iprimeq| , where
for instances x ∈ Iprimeq, Iprimeq ⊆ Iq , we calculate
P(s,x) as follows.
P(s,x) = FI(s,x)FO(s,x)
and
FI(s,x) = the number of Google excerpts
after querying s in combination with x
containing instances of ca.
FO(s,x) = the total number of excerpts
found (at most 1,000).
3. fspr(s) =summationtextx∈Iprimeq B(s,x), where
B(s,x) = the number of distinct instances
of class ca found after querying pattern s in
combination with x.
The larger we choose the testset, the subset
Iprimeq of Iq, the more reliable the measures for pre-
cision and spreading. However, the number of
Google queries increases with the number of pat-
terns found for each instance we add to Iprimeq.
We £nally calculate the score of the patterns by
multiplying the individual scores:
score(s) = ffreq(s) ·fprec(s) ·fspr(s)
For ef£ciency reasons, we only compute the
scores of the patterns with the highest frequencies.
The problem remains how to recognize a (pos-
sible multi-word) instance in the Google excerpts.
For an ontology alignment setting – where the sets
Ia and Iq are not to be expanded – these problems
are trivial: we determine whether t ∈ Ia is accom-
panied by the queried expression. For a setting
where the instances of ca are not all known (e.g.
it is not likely that we have a complete list of all
books written in the world), we solve this problem
in two stages. First we identify rules per class to
extract candidate instances. Thereafter we use an
additional Google query to verify if a candidate is
indeed an instance of class ca.
Identifying a candidate instance
The identi£cation of multi-word terms is an is-
sueofresearchonitsown. However, inthissetting
we can allow ourselves to use less elaborate tech-
niques to identify candidate instances. We can do
so, since we additionally perform a check on each
extracted term. So, per class we create rules to
identify candidate instances with a focus on high
recall. Inourcurrentexperimentswethususevery
simpletermrecognitionrules, basedonregularex-
pressions. For example, we identify a candidate
instance of class ‘person’ if the queried expression
is accompanied by two or three capitalized words.
Identifying an instance-class relation
We are interested in the question whether some
extracted term t is an instance of class ca. For ex-
ample, given the term ‘The Godfather’, does this
term belong to the class ‘movie’? The instance-
class relation can be viewed of as a hyponym re-
lation. We therefore verify the hypothesis of t be-
ing an instance of ca by Googling hyponym rela-
tion patterns. We use a £xed set H of common
patterns expressing the hyponym relation (Hearst,
1992; Cimiano and Staab, 2004), see Table 1. For
the class names, we use plurals.
We use these patterns in the following accep-
tance function
acceptcq(t) := (
summationdisplay
p∈H
h(p,cq,t) ≥ n),
4
"cq including t and"
"cq for example t and"
"cq like t and"
"cq such as t and"
Table1: Hearstpatternsforinstance-classrelation.
where h(p,cq,t) is the number of Google hits for
query with pattern p combined with term t and the
plural form of the class name cq. The threshold
n has to be chosen beforehand. We can do so, by
calculating the sum of Google hits for queries with
known instances of the class. Based on these £g-
ures, a threshold can be chosen e.g. the minimum
of these sums.
Note that term t is both preceded and followed
by a £xed phrase in the queries. We do so, to
guarantee that t is indeed the full term we are in-
terested in. For example, if we had extracted the
term ‘Los’ instead of ‘Los Angeles’ as a Califor-
nianCity, wewouldfalselyidentify‘Los’asaCal-
ifornian City, when we do not let ‘Los’ follow by
the £xed expression and. The number of Google
hits for some expression x is at least the number
of Google hits when querying the same expression
followed by some expression y.
If we identify a term t as being an instance of
class ca, we can add this term to the set Ia. How-
ever, we cannot relate t to an instance in Iq, since
thepatternusedto£ndthasnotproventobeeffec-
tive yet (e.g. the pattern could express a different
relation between one of the instance-pairs in the
training set).
We reduce the amount of Google queries by us-
ing a list of terms found that do not belong to ca.
Terms that occur multiple times in the excerpts
can then be checked only once. Moreover, we use
the OR-clause to combine the individual queries
into one. We then check if the number of hits
to this query exceeds the threshold. The amount
of Google queries in this phase thus equals the
amount of distinct terms extracted.
5 The use of surface text patterns in
information extraction
Having a method to identify relation patterns, we
now focus on utilizing these patterns in informa-
tion extraction from texts found by a search en-
gine. We use an ontology to represent the infor-
mation extracted.
Suppose we have an ontology O with classes
(c1,c2,...) and corresponding instance sets
(I1,I2,..). On these classes, relations R(i,j)1
are de£ned, with i and j the index number of
the classes. The non-empty sets T(i,j) contain
the training set of instance-pairs of the relations
R(i,j).
Per instance, we maintain a list of expressions
that already have been used as a query. Initially,
these are empty.
The £rst step of the algorithm is to learn surface
text patterns for each relation in O.
The following steps of the algorithm are per-
formed until either some stop criterion is reached,
or no more new instances and instance-pairs can
be found.
- Step 1: Select a relation R(i,j), and an in-
stance v from either Ii or Ij such that there
exists at least one pattern expressing R(i,j)
we have not yet queried in combination with
v.
- Step 2: Construct queries using the patterns
with v and send these queries to Google.
- Step 3: Extract instances from the excerpts.
- Step 4: Add the newly found instances to
the corresponding instance set and add the
instance-pairs found (thus with v) to T(i,j).
- Step 5: If there exists an instance that we can
use to formulate new queries, then repeat the
procedure.
Else, learn new patterns using the extracted
instance-pairs and then repeat the procedure.
Note that instances of class cx learned using the
algorithm applied on relation R(x,y) can be used
as input for the algorithm applied to some relation
R(x,z) to populate the sets Iz and T(x,z).
6 Experiments
Inthissection,wediscusstwoexperimentsthatwe
have conducted. The £rst experiment involves the
identi£cation of effective hyponym patterns. The
second experiment is an illustration of the applica-
tion of learned surface text patterns in information
extraction.
1Assuming one relation per pair of classes. We can use
another index k in R(i,j,k) to distinct multiple relations be-
tween ci and cj.
5
6.1 Learning effective hyponym patterns
Weareinterestedwhethertheeffectivesurfacetext
patterns are indeed intuitive formulations of some
relation R. As a test-case, we compute the most
effective patterns for the hyponym relation using a
test set with names of all countries.
Our experiment was set up as follows. We col-
lected the complete list of countries in the world
from the CIA World Factbook2. Let Iq be this set
of countries, and let Ia be the set { ‘countries’,
‘country’ }. The set TR consists of all pairs (a,
‘countries’) and (a, ‘country’) , for a ∈ Ia. We
apply the surface text pattern learning algorithm
on this set TR.
Thealgorithmidenti£edalmost40,000patterns.
We computed fspr and fprec for the 1,000 most
frequently found patterns. In table 2, we give the
25 most effective patterns found by the algorithm.
Weconsiderthepatternsinboldfacetruehyponym
patterns. Focussing on these patterns, we observe
two groups: ‘is a’ and Hearst-like patterns.
pattern freq prec spr
(countries) like 645 0.66 134
(countries) such as 537 0.54 126
is a small (country) 142 0.69 110
(country) code for 342 0.36 84
(country) map of 345 0.34 78
(countries) including 430 0.21 93
is the only (country) 138 0.55 102
is a (country) 339 0.22 99
(country) ¤ag of 251 0.63 46
and other (countries) 279 0.34 72
and neighboring (countries) 164 0.43 92
(country) name republic of 83 0.93 76
(country) book of 59 0.77 118
is a poor (country) 63 0.73 106
is the £rst (country) 53 0.70 112
(countries) except 146 0.37 76
(country) code for calling 157 0.95 26
is an independent (country) 62 0.55 114
and surrounding (countries) 84 0.40 107
is one of the poorest (countries) 61 0.75 78
and several other (countries) 65 0.59 90
among other (countries) 84 0.38 97
is a sovereign (country) 48 0.69 89
or any other (countries) 87 0.58 58
(countries) namely 58 0.44 109
Table 2: Learned hyponym patterns and their
scores.
The Hearst-patterns ‘like’ and ‘such as’ show to
be the most effective. This observation is useful,
when we want to minimize the amount of queries
for hyponym patterns.
Expressions of properties that hold for each
2http://www.cia.gov/cia/publications/factbook
countryandonlyforcountries, forexampletheex-
istence of a country code for dialing, are not triv-
iallyidenti£edmanuallybutareusefulandreliable
patterns.
The combination of ‘is a’, ‘is an’ or ‘is the’ with
an adjective is a common pattern, occurring 2,400
timesinthelist. Infuturework, weplantoidentify
such adjectives in Google excerpts using a Part of
Speech tagger (Brill, 1992).
6.2 Applying learned patterns in information
extraction
The Text Retrieval Conference (TREC) question
answering track in 2004 contains list question,
for example ‘Who are Nirvana’s band members?’
(Voorhees, 2004). We illustrate the use of our on-
tology population algorithm in the context of such
list-question answering with a small case-study.
Note that we do not consider the processing of the
question itself in this research.
Inspired by one of the questions (‘What coun-
triesisBurgerKinglocatedin?’),weareinterested
in populating an ontology with restaurants and the
countries in which they operate. We identify the
classes ‘country’ and ‘restaurant’ and the relation
‘located in’ between the classes.
We hand the algorithm the instances of ‘coun-
try’, as well as two instances of ‘restaurant’: ‘Mc-
Donald’s’ and ‘KFC’. Moreover, we add three
instance-pairs of the relation to the algorithm. We
use these pairs and a subset Iprimecountry of size eight
to compute a ranked list of the patterns. We ex-
tract terms consisting of one up to four capital-
ized words. In this test we set the threshold for
the number of Google results for the queries with
the extracted terms to 50. After a small test with
names of international restaurant branches, this
seemed an appropriate threshold.
The algorithm learned, besides a ranked list of
170 surface text patterns (Table 3), a list of 54 in-
stances of restaurant (Table 4). Among these in-
stances are indeed the names of large international
chains, Burger King being one of them. Less
expected are the names of geographic locations
and names of famous cuisines such as ‘Chinese’
and ‘French’. The last category of false instances
found that have not be £ltered out, are a number of
very common words (e.g. ‘It’ and ‘There’).
We populate the ontology with relations found
between Burger King and instances from country
using the 20 most effective patterns.
6
pattern prec spr freq
ca restaurants of cq 0.24 15 21
ca restaurants in cq 0.07 19 9
ca hamburger chain that occupies
villages throughout modern day cq 1.0 1 7
ca restaurant in cq 0.06 16 6
ca restaurants in the cq 0.13 16 2
ca hamburger restaurant in southern cq 1.0 1 4
Table 3: Top learned patterns for the restaurant-
country (ca - cq) relation.
Chinese Bank Outback Steakhouse
Denny’s Pizza Hut Kentucky Fried Chicken
Subway Taco Bell Continental
Holywood Wendy’s Long John Silver’s
HOTEL OR This Burger King
Japanese West Keg Steakhouse
You BP Outback
World Brazil San Francisco
Leo Victoria New York
These Lyons Starbucks
FELIX Roy California Pizza Kitchen
Marks Cities Emperor
Friendly Harvest Friday
New York Vienna Montana
Louis XV Greens Red Lobster
Good It There
That Mark Dunkin Donuts
Italia French Tim Hortons
Table 4: Learned instances for restaurant.
The algorithm returned 69 instance-pairs with
countries related to ‘Burger King’. On the Burger
King website3 a list of the 65 countries can be
found in which the hamburger chain operates. Of
these 65 countries, we identi£ed 55. This implies
that our results have a precision of 5569 = 80% and
recall of 5565 = 85%. Many of the falsely related
countries – mostly in eastern Europe – are loca-
tions where Burger King is said to have plans to
expand its empire.
7 Conclusions
We have presented a novel approach to identify
useful surface text patterns for information extrac-
tion using an internet search engine. We argued
that the selection of patterns has to be based on
effectiveness: a pattern has to occur frequently, it
has to be precise and has to be wide-spread if it
represents a non-functional relation.
These criteria are combined in a scoring func-
tion which we use to select the most effective pat-
terns.
3http://www.whopper.com
The method presented can be used for arbitrary
relations, thus also relations that link an instance
to multiple other instances. These patterns can be
used in information extraction. We combine pat-
terns with an instance and offer such an expression
as a query to a search engine. From the excerpts
retrieved, we extract instances and simultaneously
instance-pairs.
Learning surface text patterns is ef£cient with
respect to the number of queries if we know all
instances of the classes concerned. The £rst part
of the algorithm is linear to the size of the training
set. Furthermore, we select the n most frequent
patterns and perform |Iprimeq| · n queries to compute
the score of these n patterns.
However, for a setting where Iprimea is incomplete,
we have to perform a check for each unique term
identi£ed as a candidate instance in the excerpts
found by the |Iprimeq| · n queries. The number of
queries, one for each extracted unique candidate
instance, thus fully depends on the rules that are
used to identify a candidate instance.
We apply the learned patterns in an ontology
population algorithm. We combine the learned
high quality relation patterns with an instance in
a query. In this way we can perform a range of ef-
fective queries to £nd instances of some class and
simultaneously £nd instance-pairs of the relation.
A £rst experiment, the identi£cation of hy-
ponym patterns, showed that the patterns identi-
£ed indeed intuitively re¤ect the relation consid-
ered. Moreover, we have generated a ranked list
of hyponym patterns. The experiment with the
restaurant ontology illustrated that a small train-
ing set suf£ces to learn effective patterns and pop-
ulate an ontology with good precision and recall.
The algorithm performs well with respect to re-
call of the instances found: many big international
restaurant branches were found. The identi£cation
of the instances however is open to improvement,
since the additional check does not £lter out all
falsely identi£ed candidate instances.
8 Future work
Currently we check whether an extracted term is
indeed an instance of some class by querying hy-
ponym patterns. However, if we £nd two in-
stances related by some surface text pattern, we al-
ways accept these instances as instance pair. Thus,
if we both £nd ‘Mozart was born in Germany’
and ‘Mozart was born in Austria’, both extracted
7
instance-pairs are added to our ontology. We
thus need some post-processing to remove falsely
found instance-pairs. When we know that a re-
lation is functional, we can select the most fre-
quently occurring instance-pair.
Moreover, theprocessofidentifyinganinstance
in a text needs further research especially since
the method to identify instance-class relations by
querying hyponym patterns is not ¤awless.
The challenge thus lies in the area of improving
the precision of the output of the ontology pop-
ulation algorithm. With additional £ltering tech-
niques and more elaborated identi£cation tech-
niques we expect to be able to improve the pre-
cision of the output. We plan to research check
functions based on enumerations of candidate in-
stances with known instances of the class. For ex-
ample, the enumeration ‘KFC, Chinese and Mc-
Donald’s’ is not found by Google, where ‘KFC,
Burger King and McDonald’s’ gives 31 hits.
Our experiment with the extraction of hyponym
patterns, suggests a ranking of Hearst-patterns
based on the effectiveness. Knowledge on the ef-
fectiveness of each of the Hearst-patterns can be
utilized to minimize the amount of queries.
Finally we will investigate ways to compare our
methods with other systems in a TREC like setting
with the web as a corpus.
Acknowledgments
We thank our colleagues Bart Bakker and Dragan
Sekulovski and the anonymous reviewers for their
useful comments on earlier versions of this paper.

References
E. Agichtein and L. Gravano. 2000. Snowball: Ex-
tracting relations from large plain-text collections.
In Proceedings of the Fifth ACM International Con-
ference on Digital Libraries.
E. Brill. 1992. A simple rule-based part-of-speech
tagger. In Proceedings of the third Conference on
Applied Natural Language Processing (ANLP’92),
pages 152–155, Trento, Italy.
S. Brin. 1998. Extracting patterns and relations from
the world wide web. In WebDB Workshop at sixth
International Conference on Extending Database
Technology (EDBT’98).
P. Buitelaar, P. Cimiano, and B. Magnini, editors.
2005. Ontology Learning from Text: Methods, Eval-
uation and Applications, volume 123 of Frontiers in
Arti£cial Intelligence and Applications. IOS Press.
N. A. Chinchor, editor. 1998. Proceedings of the Sev-
enth Message Understanding Conference (MUC-7).
Morgan Kaufmann, Fairfax, Virginia.
P. Cimiano and S. Staab. 2004. Learning by googling.
SIGKDD Explorations Newsletter, 6(2):24–33.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T.Mitchell, K.Nigam, and S.Slattery. 2000. Learn-
ing to construct knowledge bases from the World
Wide Web. Arti£cial Intelligence, 118:69–113.
V. Crescenzi and G. Mecca. 2004. Automatic infor-
mation extraction from large websites. Journal of
the ACM, 51(5):731–779.
O. Etzioni, M. J. Cafarella, D., A. Popescu, T. Shaked,
S. Soderland, D. S. Weld, and A. Yates. 2005. Un-
supervised named-entity extraction from the web:
An experimental study. Arti£cial Intelligence,
165(1):91–134.
K. Frantzi, S. Ananiado, and H. Mima. 2000. Au-
tomatic recognition of multi-word terms: the c-
value/nc-value method. International Journal on
Digital Libraries, 3:115–130.
G. Geleijnse and J. Korst. 2005. Automatic ontology
population by googling. In Proceedings of the Sev-
enteenth Belgium-Netherlands Conference on Arti-
£cial Intelligence (BNAIC 2005), pages 120 – 126,
Brussels, Belgium.
M. Hearst. 1992. Automatic acquisition of hyponyms
from large text corpora. In Proceedings of the
14th conference on Computational linguistics, pages
539–545, Morristown, NJ, USA.
M. Hearst. 1998. Automated discovery of wordnet
relations. In Christiane Fellbaum, editor, WordNet:
An Electronic Lexical Database. MIT Press, Cam-
bridge, MA.
G. Nenadi´c, I. Spasi´c, and S. Ananiadou. 2002. Au-
tomatic discovery of term similarities using pattern
mining. In Proceedings of the second international
workshop on Computational Terminology (CompuT-
erm’02), Taipei, Taiwan.
D. Ravichandran and E. Hovy. 2002. Learning surface
text patterns for a question answering system. In
Proceedings of the 40th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2002),
pages 41–47, Philadelphia, PA.
E. Voorhees. 2004. Overview of the trec 2004 ques-
tion answering track. In Proceedings of the 13th
Text Retrieval Conference (TREC 2004), Gaithers-
burg, Maryland.
G. Zhou and J. Su. 2002. Named entity recognition
using an hmm-based chunk tagger. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL 2002), pages 473 –
480, Philadelphia, PA.
