Scaling Web-based Acquisition of Entailment Relations
Idan Szpektordiamondmath
idan@szpektor.net
†ITC-Irst, Via Sommarive, 18 (Povo) - 38050 Trento, Italy
‡DIT - University of Trento, Via Sommarive, 14 (Povo) - 38050 Trento, Italy
◦Department of Computer Science, Bar Ilan University - Ramat Gan 52900, Israel
diamondmathDepartment of Computer Science, Tel Aviv University - Tel Aviv 69978, Israel
Hristo Tanev†
tanev@itc.it
Ido Dagan◦
dagan@cs.biu.ac.il
Bonaventura Coppola†‡
coppolab@itc.it
Abstract
Paraphrase recognition is a critical step for nat-
ural language interpretation. Accordingly, many
NLP applications would benefit from high coverage
knowledge bases of paraphrases. However, the scal-
ability of state-of-the-art paraphrase acquisition ap-
proaches is still limited. We present a fully unsuper-
vised learning algorithm for Web-based extraction
of entailment relations, an extended model of para-
phrases. We focus on increased scalability and gen-
erality with respect to prior work, eventually aiming
at a full scale knowledge base. Our current imple-
mentation of the algorithm takes as its input a verb
lexicon and for each verb searches the Web for re-
lated syntactic entailment templates. Experiments
show promising results with respect to the ultimate
goal, achieving much better scalability than prior
Web-based methods.
1 Introduction
Modeling semantic variability in language has
drawn a lot of attention in recent years. Many ap-
plications like QA, IR, IE and Machine Translation
(Moldovan and Rus, 2001; Hermjakob et al., 2003;
Jacquemin, 1999) have to recognize that the same
meaning can be expressed in the text in a huge vari-
ety of surface forms. Substantial research has been
dedicated to acquiring paraphrase patterns, which
represent various forms in which a certain meaning
can be expressed.
Following (Dagan and Glickman, 2004) we ob-
serve that a somewhat more general notion needed
for applications is that of entailment relations (e.g.
(Moldovan and Rus, 2001)). These are directional
relations between two expressions, where the mean-
ing of one can be entailed from the meaning of the
other. For example “X acquired Y” entails “X owns
Y”. These relations provide a broad framework for
representing and recognizing semantic variability,
as proposed in (Dagan and Glickman, 2004). For
example, if a QA system has to answer the question
“Who owns Overture?” and the corpus includes the
phrase “Yahoo acquired Overture”, the system can
use the known entailment relation to conclude that
this phrase really indicates the desired answer. More
examples of entailment relations, acquired by our
method, can be found in Table 1 (section 4).
To perform such inferences at a broad scale, ap-
plications need to possess a large knowledge base
(KB) of entailment patterns. We estimate such a
KB should contain from between a handful to a few
dozens of relations per meaning, which may sum
to a few hundred thousands of relations for a broad
domain, given that a typical lexicon includes tens of
thousands of words.
Our research goal is to approach unsupervised ac-
quisition of such a full scale KB. We focus on de-
veloping methods that acquire entailment relations
from the Web, the largest available resource. To
this end substantial improvements are needed in or-
der to promote scalability relative to current Web-
based approaches. In particular, we address two
major goals: reducing dramatically the complexity
of required auxiliary inputs, thus enabling to apply
the methods at larger scales, and generalizing the
types of structures that can be acquired. The algo-
rithms described in this paper were applied for ac-
quiring entailment relations for verb-based expres-
sions. They successfully discovered several rela-
tions on average per each randomly selected expres-
sion.
2 Background and Motivations
This section provides a qualitative view of prior
work, emphasizing the perspective of aiming at a
full-scale paraphrase resource. As there are still
no standard benchmarks, current quantitative results
are not comparable in a consistent way.
The major idea in paraphrase acquisition is often
to find linguistic structures, here termed templates,
that share the same anchors. Anchors are lexical
elements describing the context of a sentence. Tem-
plates that are extracted from different sentences
and connect the same anchors in these sentences,
are assumed to paraphrase each other. For example,
the sentences “Yahoo bought Overture” and “Yahoo
acquired Overture” share the anchors {X=Yahoo,
Y =Overture}, suggesting that the templates ‘X buy
Y’ and ‘X acquire Y’ paraphrase each other. Algo-
rithms for paraphrase acquisition address two prob-
lems: (a) finding matching anchors and (b) identify-
ing template structure, as reviewed in the next two
subsections.
2.1 Finding Matching Anchors
The prominent approach for paraphrase learning
searches sentences that share common sets of mul-
tiple anchors, assuming they describe roughly the
same fact or event. To facilitate finding many
matching sentences, highly redundant comparable
corpora have been used. These include multiple
translations of the same text (Barzilay and McKe-
own, 2001) and corresponding articles from multi-
ple news sources (Shinyama et al., 2002; Pang et
al., 2003; Barzilay and Lee, 2003). While facilitat-
ing accuracy, we assume that comparable corpora
cannot be a sole resource due to their limited avail-
ability.
Avoiding a comparable corpus, (Glickman and
Dagan, 2003) developed statistical methods that
match verb paraphrases within a regular corpus.
Their limited scale results, obtaining several hun-
dred verb paraphrases from a 15 million word cor-
pus, suggest that much larger corpora are required.
Naturally, the largest available corpus is the Web.
Since exhaustive processing of the Web is not feasi-
ble, (Duclaye et al., 2002) and (Ravichandran and
Hovy, 2002) attempted bootstrapping approaches,
which resemble the mutual bootstrapping method
for Information Extraction of (Riloff and Jones,
1999). These methods start with a provided known
set of anchors for a target meaning. For example,
the known anchor set{Mozart, 1756}is given as in-
put in order to find paraphrases for the template ‘X
born in Y’. Web searching is then used to find occur-
rences of the input anchor set, resulting in new tem-
plates that are supposed to specify the same relation
as the original one (“born in”). These new templates
are then exploited to get new anchor sets, which
are subsequently processed as the initial {Mozart,
1756}. Eventually, the overall procedure results in
an iterative process able to induce templates from
anchor sets and vice versa.
The limitation of this approach is the requirement
for one input anchor set per target meaning. Prepar-
ing such input for all possible meanings in broad
domains would be a huge task. As will be explained
below, our method avoids this limitation by find-
ing all anchor sets automatically in an unsupervised
manner.
Finally, (Lin and Pantel, 2001) present a notably
different approach that relies on matching sepa-
rately single anchors. They limit the allowed struc-
ture of templates only to paths in dependency parses
connecting two anchors. The algorithm constructs
for each possible template two feature vectors, rep-
resenting its co-occurrence statistics with the two
anchors. Two templates with similar vectors are
suggested as paraphrases (termed inference rule).
Matching of single anchors relies on the gen-
eral distributional similarity principle and unlike the
other methods does not require redundancy of sets
of multiple anchors. Consequently, a much larger
number of paraphrases can be found in a regular
corpus. Lin and Pantel report experiments for 9
templates, in which their system extracted 10 cor-
rect inference rules on average per input template,
from 1GB of news data. Yet, this method also suf-
fers from certain limitations: (a) it identifies only
templates with pre-specified structures; (b) accuracy
seems more limited, due to the weaker notion of
similarity; and (c) coverage is limited to the scope
of an available corpus.
To conclude, several approaches exhaustively
process different types of corpora, obtaining vary-
ing scales of output. On the other hand, the Web is
a huge promising resource, but current Web-based
methods suffer serious scalability constraints.
2.2 Identifying Template Structure
Paraphrasing approaches learn different kinds of
template structures. Interesting algorithms are pre-
sented in (Pang et al., 2003; Barzilay and Lee,
2003). They learn linear patterns within similar con-
texts represented as finite state automata. Three
classes of syntactic template learning approaches
are presented in the literature: learning of predicate
argument templates (Yangarber et al., 2000), learn-
ing of syntactic chains (Lin and Pantel, 2001) and
learning of sub-trees (Sudo et al., 2003). The last
approach is the most general with respect to the tem-
plate form. However, its processing time increases
exponentially with the size of the templates.
As a conclusion, state of the art approaches still
learn templates of limited form and size, thus re-
stricting generality of the learning process.
3 The TE/ASE Acquisition Method
Motivated by prior experience, we identify two ma-
jor goals for scaling Web-based acquisition of en-
tailment relations: (a) Covering the broadest pos-
sible range of meanings, while requiring minimal
input and (b) Keeping template structures as gen-
eral as possible. To address the first goal we re-
quire as input only a phrasal lexicon of the rel-
evant domain (including single words and multi-
word expressions). Broad coverage lexicons are
widely available or may be constructed using known
term acquisition techniques, making it a feasible
and scalable input requirement. We then aim to
acquire entailment relations that include any of the
lexicon’s entries. The second goal is addressed by a
novel algorithm for extracting the most general tem-
plates being justified by the data.
For each lexicon entry, denoted a pivot, our
extraction method performs two phases: (a) ex-
tract promising anchor sets for that pivot (ASE,
Section 3.1), and (b) from sentences contain-
ing the anchor sets, extract templates for which
an entailment relation holds with the pivot (TE,
Section 3.2). Examples for verb pivots are:
‘acquire’, ‘fall to’, ‘prevent’. We will use the pivot
‘prevent’ for examples through this section.
Before presenting the acquisition method we first
define its output. A template is a dependency parse-
tree fragment, with variable slots at some tree nodes
(e.g. ‘X subj← prevent obj→Y’). An entailment rela-
tion between two templates T1 and T2 holds if
the meaning of T2 can be inferred from the mean-
ing of T1 (or vice versa) in some contexts, but
not necessarily all, under the same variable instan-
tiation. For example, ‘X subj← prevent obj→Y’ entails
‘X subj← reduce obj→Y risk’ because the sentence “as-
pirin reduces heart attack risk” can be inferred from
“aspirin prevents a first heart attack”. Our output
consists of pairs of templates for which an entail-
ment relation holds.
3.1 Anchor Set Extraction (ASE)
The goal of this phase is to find a substantial num-
ber of promising anchor sets for each pivot. A good
anchor-set should satisfy a proper balance between
specificity and generality. On one hand, an anchor
set should correspond to a sufficiently specific set-
ting, so that entailment would hold between its dif-
ferent occurrences. On the other hand, it should be
sufficiently frequent to appear with different entail-
ing templates.
Finding good anchor sets based on just the input
pivot is a hard task. Most methods identify good re-
peated anchors “in retrospect”, that is after process-
ing a full corpus, while previous Web-based meth-
ods require at least one good anchor set as input.
Given our minimal input, we needed refined crite-
ria that identify a priori the relatively few promising
anchor sets within a sample of pivot occurrences.
ASE ALGORITHM STEPS:
For each pivot (a lexicon entry)
1. Create a pivot template, Tp
2. Construct a parsed sample corpus S for Tp:
(a) Retrieve an initial sample from the Web
(b) Identify associated phrases for the pivot
(c) Extend S using the associated phrases
3. Extract candidate anchor sets from S:
(a) Extract slot anchors
(b) Extract context anchors
4. Filter the candidate anchor sets:
(a) by absolute frequency
(b) by conditional pivot probability
Figure 1: Outline of the ASE algorithm.
The ASE algorithm (presented in Figure 1) per-
forms 4 main steps.
STEP (1) creates a complete template, called the
pivot template and denoted Tp, for the input pivot,
denoted P. Variable slots are added for the ma-
jor types of syntactic relations that interact with P,
based on its syntactic type. These slots enable us to
later match Tp with other templates. For verbs, we
add slots for a subject and for an object or a modifier
(e.g. ‘X subj← prevent obj→Y’).
STEP (2) constructs a sample corpus, denoted S,
for the pivot template. STEP (2.A) utilizes a Web
search engine to initialize S by retrieving sentences
containing P. The sentences are parsed by the
MINIPAR dependency parser (Lin, 1998), keeping
only sentences that contain the complete syntactic
template Tp (with all the variables instantiated).
STEP (2.B) identifies phrases that are statistically
associated with Tp in S. We test all noun-phrases
in S , discarding phrases that are too common on
the Web (absolute frequency higher than a thresh-
old MAXPHRASEF), such as “desire”. Then we se-
lect the N phrases with highest tf·idf score1. These
phrases have a strong collocation relationship with
the pivot P and are likely to indicate topical (rather
than anecdotal) occurrences of P. For example, the
phrases “patient” and “American Dental Associa-
tion”, which indicate contexts of preventing health
problems, were selected for the pivot ‘prevent’. Fi-
1Here, tf·idf = freqS(X) · log
parenleftBig
N
freqW(X)
parenrightBig
where freqS(X) is the number of occurrences in S containing
X, N is the total number of Web documents, and freqW(X)
is the number of Web documents containing X.
nally, STEP (2.C) expands S by querying the Web
with the both P and each of the associated phrases,
adding the retrieved sentences to S as in step (2.a).
STEP (3) extracts candidate anchor sets for Tp.
From each sentence in S we try to generate one can-
didate set, containing noun phrases whose Web fre-
quency is lower than MAXPHRASEF. STEP (3.A)
extracts slot anchors – phrases that instantiate the
slot variables of Tp. Each anchor is marked
with the corresponding slot. For example, the
anchors {antibioticssubj←, miscarriage obj←} were ex-
tracted from the sentence “antibiotics in pregnancy
prevent miscarriage”.
STEP (3.B) tries to extend each candidate set with
one additional context anchor, in order to improve
its specificity. This anchor is chosen as the highest
tf·idf scoring phrase in the sentence, if it exists. In
the previous example, ‘pregnancy’ is selected.
STEP (4) filters out bad candidate anchor sets by
two different criteria. STEP (4.A) maintains only
candidates with absolute Web frequency within a
threshold range [MINSETF, MAXSETF], to guaran-
tee an appropriate specificity-generality level. STEP
(4.B) guarantees sufficient (directional) association
between the candidate anchor set c and Tp, by esti-
mating
Prob(Tp|c)≈ freqW(P ∩c)freq
W(c)
where freqW is Web frequency and P is the pivot.
We maintain only candidates for which this prob-
ability falls within a threshold range [SETMINP,
SETMAXP]. Higher probability often corresponds
to a strong linguistic collocation between the
candidate and Tp, without any semantic entail-
ment. Lower probability indicates coincidental co-
occurrence, without a consistent semantic relation.
The remaining candidates in S become the in-
put anchor-sets for the template extraction phase,
for example,{Aspirinsubj←, heart attackobj←}for ‘pre-
vent’.
3.2 Template Extraction (TE)
The Template Extraction algorithm accepts as its in-
put a list of anchor sets extracted from ASE for each
pivot template. Then, TE generates a set of syntactic
templates which are supposed to maintain an entail-
ment relationship with the initial pivot template. TE
performs three main steps, described in the follow-
ing subsections:
1. Acquisition of a sample corpus from the Web.
2. Extraction of maximal most general templates
from that corpus.
3. Post-processing and final ranking of extracted
templates.
3.2.1 Acquisition of a sample corpus from the
Web
For each input anchor set, TE acquires from the
Web a sample corpus of sentences containing it.
For example, a sentence from the sample corpus
for {aspirin, heart attack} is: “Aspirin stops heart
attack?”. All of the sample sentences are then
parsed with MINIPAR (Lin, 1998), which gener-
ates from each sentence a syntactic directed acyclic
graph (DAG) representing the dependency structure
of the sentence. Each vertex in this graph is labeled
with a word and some morphological information;
each graph edge is labeled with the syntactic rela-
tion between the words it connects.
TE then substitutes each slot anchor (see section
3.1) in the parse graphs with its corresponding slot
variable. Therefore, “Aspirin stops heart attack?”
will be transformed into ‘X stop Y’. This way all
the anchors for a certain slot are unified under the
same variable name in all sentences. The parsed
sentences related to all of the anchor sets are sub-
sequently merged into a single set of parse graphs
S ={P1,P2,...,Pn}(see P1 and P2 in Figure 2).
3.2.2 Extraction of maximal most general
templates
The core of TE is a General Structure Learning al-
gorithm (GSL) that is applied to the set of parse
graphs S resulting from the previous step. GSL
extracts single-rooted syntactic DAGs, which are
named spanning templates since they must span at
least over Na slot variables, and should also ap-
pear in at least Nr sentences from S (In our exper-
iments we set Na=2 and Nr=2). GSL learns maxi-
mal most general templates: they are spanning tem-
plates which, at the same time, (a) cannot be gener-
alized by further reduction and (b) cannot be further
extended keeping the same generality level.
In order to properly define the notion of maximal
most general templates, we introduce some formal
definitions and notations.
DEFINITION: For a spanning template t we define
a sentence set, denoted with σ(t), as the set of all
parsed sentences in S containing t.
For each pair of templates t1 and t2, we use the no-
tation t1 precedesequalt2 to denote that t1 is included as a sub-
graph or is equal to t2. We use the notation t1 ≺t2
when such inclusion holds strictly. We define T(S)
as the set of all spanning templates in the sample S.
DEFINITION: A spanning template t ∈ T(S) is
maximal most general if and only if both of the fol-
lowing conditions hold:
CONDITION A: For∀tprime∈T(S),tprimeprecedesequalt, it holds that
σ(t) = σ(tprime).
CONDITION B: For∀tprime∈T(S),t≺tprime, it holds that
σ(t)⊃σ(tprime).
Condition A ensures that the extracted templates do
not contain spanning sub-structures that are more
”general” (i.e. having a larger sentence set); con-
dition B ensures that the template cannot be further
enlarged without reducing its sentence set.
GSL performs template extraction in two main
steps: (1) build a compact graph representation of
all the parse graphs from S; (2) extract templates
from the compact representation.
A compact graph representation is an aggregate
graph which joins all the sentence graphs from S
ensuring that all identical spanning sub-structures
from different sentences are merged into a single
one. Therefore, each vertex v (respectively, edge
e) in the aggregate graph is either a copy of a cor-
responding vertex (edge) from a sentence graph Pi
or it represents the merging of several identically
labeled vertices (edges) from different sentences in
S. The set of such sentences is defined as the sen-
tence set of v (e), and is represented through the set
of index numbers of related sentences (e.g. “(1,2)”
in the third tree of Figure 2). We will denote with
Gi the compact graph representation of the first i
sentences in S. The parse trees P1 and P2 of two
sentences and their related compact representation
G2 are shown in Figure 2.
Building the compact graph representation
The compact graph representation is built incremen-
tally. The algorithm starts with an empty aggregate
graph G0 and then merges the sentence graphs from
S one at a time into the aggregate structure.
Let’s denote the current aggregate graph with
Gi−1(Vg,Eg) and let Pi(Vp,Ep) be the parse graph
which will be merged next. Note that the sentence
set of Pi is a single element set{i}.
During each iteration a new graph is created as
the union of both input graphs: Gi = Gi−1∪Pi.
Then, the following merging procedure is per-
formed on the elements of Gi
1. ADDING GENERALIZED VERTICES TO Gi.
For every two vertices vg ∈ Vg,vp ∈ Vp having
equal labels, a new generalized vertex vnewg is cre-
ated and added to Gi. The new vertex takes the same
label and holds a sentence set which is formed from
the sentence set of vg by adding i to it. Still with
reference to Figure 2, the generalized vertices in G2
are ‘X’, ‘Y’ and ‘stop’. The algorithm connects the
generalized vertex vnewg with all the vertices which
are connected with vg and vp.
2. MERGING EDGES. If two edges eg ∈Eg and
ep ∈ Ep have equal labels and their corresponding
adjacent vertices have been merged, then ea and ep
are also merged into a new edge. In Figure 2 the
edges (‘stop’, ‘X’) and (‘stop’, ‘Y’) from P1 and
P2 are eventually merged into G2.
3. DELETING MERGED VERTICES. Every vertex
v from Vp or Vg for which at least one generalized
vertex vnewg exists is deleted from Gi.
As an optimization step, we merge only vertices
and edges that are included in equal spanning tem-
plates.
Extracting the templates
GSL extracts all maximal most general templates
from the final compact representation Gn using the
following sub-algorithm:
1. BUILDING MINIMAL SPANNING TREES. For
every Na different slot variables in Gn having a
common ancestor, a minimal spanning tree st is
built. Its sentence set is computed as the intersec-
tion of the sentence sets of its edges and vertices.
2. EXPANDING THE SPANNING TREES. Every
minimal spanning tree st is expanded to the maxi-
mal sub-graph maxst whose sentence set is equal to
σ(st). All maximal single-rooted DAGs in maxst
are extracted as candidate templates. Maximality
ensures that the extracted templates cannot be ex-
panded further while keeping the same sentence set,
satisfying condition B.
3. FILTERING. Candidates which contain an-
other candidate with a larger sentence set are filtered
out. This step guarantees condition A.
In Figure 2 the maximal most general template in
G2 is ‘X subj← stop obj→Y’.
3.2.3 Post-processing and ranking of extracted
templates
As a last step, names and numbers are filtered out
from the templates. Moreover, TE removes those
templates which are very long or which appear with
just one anchor set and in less than four sentences.
Finally, the templates are sorted first by the number
of anchor sets with which each template appeared,
and then by the number of sentences in which they
appeared.
4 Evaluation
We evaluated the results of the TE/ASE algorithm
on a random lexicon of verbal forms and then as-
sessed its performance on the extracted data through
human-based judgments.
P1 : stop
subjd122
d122d122
d124d124d122d122d122d122
obj
d65d65d65
d32d32d65d65d65d65
P2 : stop
subjd122
d122d122
d124d124d122d122d122d122
obj
d15d15
byd74
d74d74d74
d37d37d74d74d74d74
G2 : stop(1,2)
subj(1,2)d114d114
d114d114
d120d120d114d114d114d114 obj(1,2) d15d15 by(2)
d79d79d79d79
d39d39d79d79d79d79
X Y X Y absorbing X(1,2) Y(1,2) absorbing(2)
Figure 2: Two parse trees and their compact representation (sentence sets are shown in parentheses).
4.1 Experimental Setting
The test set for human evaluation was generated by
picking out 53 random verbs from the 1000 most
frequent ones found in a subset of the Reuters cor-
pus2. For each verb entry in the lexicon, we pro-
vided the judges with the corresponding pivot tem-
plate and the list of related candidate entailment
templates found by the system. The judges were
asked to evaluate entailment for a total of 752 tem-
plates, extracted for 53 pivot lexicon entries; Table
1 shows a sample of the evaluated templates; all of
them are clearly good and were judged as correct
ones.
Pivot Template Entailment Templates
X prevent Y X provides protection against Y
X reduces Y
X decreases the risk of Y
X be cure for Y
X a day keeps Y away
X to combat Y
X accuse Y X call Y indictable
X testifies against Y
Y defense before X
X acquire Y X snap up Y
Y shareholders approve X
buyout
Y shareholders receive shares
of X stock
X go back to Y Y allowed X to return
Table 1: Sample of templates found by TE/ASE and
included in the evaluation test set.
Concerning the ASE algorithm, threshold pa-
rameters3 were set as PHRASEMAXF=107, SET-
MINF=102, SETMAXF=105, SETMINP=0.066,
and SETMAXP=0.666. An upper limit of 30 was
imposed on the number of possible anchor sets used
for each pivot. Since this last value turned out to
be very conservative with respect to system cover-
2Known as Reuters Corpus, Volume 1, English Language,
1996-08-20 to 1997-08-19.
3All parameters were tuned on a disjoint development lexi-
con before the actual experiment.
age, we subsequently attempted to relax it to 50 (see
Discussion in Section 4.3).
Further post-processing was necessary over ex-
tracted data in order to remove syntactic variations
referring to the same candidate template (typically
passive/active variations).
Three possible judgment categories have been
considered: Correct if an entailment relationship
in at least one direction holds between the judged
template and the pivot template in some non-bizarre
context; Incorrect if there is no reasonable context
and variable instantiation in which entailment holds;
No Evaluation if the judge cannot come to a definite
conclusion.
4.2 Results
Each of the three assessors (referred to as J#1, J#2,
and J#3) issued judgments for the 752 different
templates. Correct templates resulted to be 283,
313, and 295 with respect to the three judges. No
evaluation’s were 2, 0, and 16, while the remaining
templates were judged Incorrect.
For each verb, we calculate Yield as the absolute
number of Correct templates found and Precision as
the percentage of good templates out of all extracted
templates. Obtained Precision is 44.15%, averaged
over the 53 verbs and the 3 judges. Considering Low
Majority on judges, the precision value is 42.39%.
Average Yield was 5.5 templates per verb.
These figures may be compared (informally, as
data is incomparable) with average yield of 10.1
and average precision of 50.3% for the 9 “pivot”
templates of (Lin and Pantel, 2001). The compar-
ison suggests that it is possible to obtain from the
(very noisy) web a similar range of precision as was
obtained from a clean news corpus. It also indi-
cates that there is potential for acquiring additional
templates per pivot, which would require further re-
search on broadening efficiently the search for addi-
tional web data per pivot.
Agreement among judges is measured by the
Kappa value, which is 0.55 between J#1 and J#2,
0.57 between J#2 and J#3, and 0.63 between J#1
and J#3. Such Kappa values correspond to moder-
ate agreement for the first two pairs and substantial
agreement for the third one. In general, unanimous
agreement among all of the three judges has been
reported on 519 out of 752 templates, which corre-
sponds to 69%.
4.3 Discussion
Our algorithm obtained encouraging results, ex-
tracting a considerable amount of interesting tem-
plates and showing inherent capability of discover-
ing complex semantic relations.
Concerning overall coverage, we managed to find
correct templates for 86% of the verbs (46 out of
53). Nonetheless, presented results show a substan-
tial margin of possible improvement. In fact yield
values (5.5 Low Majority, up to 24 in best cases),
which are our first concern, are inherently depen-
dent on the breadth of Web search performed by
the ASE algorithm. Due to computational time, the
maximal number of anchor sets processed for each
verb was held back to 30, significantly reducing the
amount of retrieved data.
In order to further investigate ASE potential, we
subsequently performed some extended experiment
trials raising the number of anchor sets per pivot
to 50. This time we randomly chose a subset of
10 verbs out of the less frequent ones in the origi-
nal main experiment. Results for these verbs in the
main experiment were an average Yield of 3 and an
average Precision of 45.19%. In contrast, the ex-
tended experiments on these verbs achieved a 6.5
Yield and 59.95% Precision (average values). These
results are indeed promising, and the substantial
growth in Yield clearly indicates that the TE/ASE
algorithms can be further improved. We thus sug-
gest that the feasibility of our approach displays the
inherent scalability of the TE/ASE process, and its
potential to acquire a large entailment relation KB
using a full scale lexicon.
A further improvement direction relates to tem-
plate ranking and filtering. While in this paper
we considered anchor sets to have equal weights,
we are also carrying out experiments with weights
based on cross-correlation between anchor sets.
5 Conclusions
We have described a scalable Web-based approach
for entailment relation acquisition which requires
only a standard phrasal lexicon as input. This min-
imal level of input is much simpler than required
by earlier web-based approaches, while succeeding
to maintain good performance. This result shows
that it is possible to identify useful anchor sets in
a fully unsupervised manner. The acquired tem-
plates demonstrate a broad range of semantic rela-
tions varying from synonymy to more complicated
entailment. These templates go beyond trivial para-
phrases, demonstrating the generality and viability
of the presented approach.
From our current experiments we can expect to
learn about 5 relations per lexicon entry, at least for
the more frequent entries. Moreover, looking at the
extended test, we can extrapolate a notably larger
yield by broadening the search space. Together with
the fact that we expect to find entailment relations
for about 85% of a lexicon, it is a significant step
towards scalability, indicating that we will be able
to extract a large scale KB for a large scale lexicon.
In future work we aim to improve the yield by in-
creasing the size of the sample-corpus in a qualita-
tive way, as well as precision, using statistical meth-
ods such as supervised learning for better anchor set
identification and cross-correlation between differ-
ent pivots. We also plan to support noun phrases
as input, in addition to verb phrases. Finally, we
would like to extend the learning task to discover the
correct entailment direction between acquired tem-
plates, completing the knowledge required by prac-
tical applications.
Like (Lin and Pantel, 2001), learning the context
for which entailment relations are valid is beyond
the scope of this paper. As stated, we learn entail-
ment relations holding for some, but not necessarily
all, contexts. In future work we also plan to find the
valid contexts for entailment relations.
Acknowledgements
The authors would like to thank Oren Glickman
(Bar Ilan University) for helpful discussions and as-
sistance in the evaluation, Bernardo Magnini for his
scientific supervision at ITC-irst, Alessandro Vallin
and Danilo Giampiccolo (ITC-irst) for their help in
developing the human based evaluation, and Prof.
Yossi Matias (Tel-Aviv University) for supervising
the first author. This work was partially supported
by the MOREWEB project, financed by Provincia
Autonoma di Trento. It was also partly carried out
within the framework of the ITC-IRST (TRENTO,
ITALY) – UNIVERSITY OF HAIFA (ISRAEL) col-
laboration project. For data visualization and analy-
sis the authors intensively used the CLARK system
(www.bultreebank.org) developed at the Bulgarian
Academy of Sciences .

References
Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: An unsupervised approach us- ing multiple-sequence alignment. In Proceedings of HLT-NAACL 2003, pages 16–23, Edmonton, Canada.
Regina Barzilay and Kathleen R. McKeown. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of ACL 2001, pages 50–57, Toulose, France.
Ido Dagan and Oren Glickman. 2004. Probabilis- tic textual entailment: Generic applied modeling of language variability. In PASCAL Workshop on Learning Methods for Text Understanding and Mining, Grenoble.
Florence Duclaye, Franc¸ois Yvon, and Olivier Collin. 2002. Using the Web as a linguistic re- source for learning reformulations automatically. In Proceedings of LREC 2002, pages 390–396, Las Palmas, Spain.
Oren Glickman and Ido Dagan. 2003. Identifying lexical paraphrases from a single corpus: a case study for verbs. In Proceedings of RANLP 2003.
Ulf Hermjakob, Abdessamad Echihabi, and Daniel Marcu. 2003. Natural language based reformula- tion resource and Web Exploitation. In Ellen M. Voorhees and Lori P. Buckland, editors, Proceed- ings of the 11th Text Retrieval Conference (TREC 2002), Gaithersburg, MD. NIST.
Christian Jacquemin. 1999. Syntagmatic and paradigmatic representations of term variation. In Proceedings of ACL 1999, pages 341–348.
Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for Question Answering. Natural Language Engineering, 7(4):343–360.
Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC 1998, Granada, Spain.
Dan Moldovan and Vasile Rus. 2001. Logic form transformation of WordNet and its applicability to Question Answering. In Proceedings of ACL 2001, pages 394–401, Toulose, France.
Bo Pang, Kevin Knight, and Daniel Marcu. 2003. Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sen- tences. In Proceedings of HLT-NAACL 2003, Ed- monton, Canada.
Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns for a Question An- swering system. In Proceedings of ACL 2002, Philadelphia, PA.
Ellen Riloff and Rosie Jones. 1999. Learning dic- tionaries for Information Extraction by multi- level bootstrapping. In Proceedings of the Six- teenth National Conference on Artificial Intelli- gence (AAAI-99), pages 474–479.
Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. 2002. Automatic para- phrase acquisition from news articles. In Pro- ceedings of Human Language Technology Con- ference (HLT 2002), San Diego, USA.
Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. 2003. An improved extraction pattern represen- tation model for automatic IE pattern acquisition. In Proceedings of ACL 2003.
Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. 2000. Un- supervised discovery of scenario-level patterns for Information Extraction. In Proceedings of COLING 2000.
