Workshop on TextGraphs, at HLT-NAACL 2006, pages 33–36,
New York City, June 2006. c©2006 Association for Computational Linguistics
Similarity between Pairs of Co-indexed Trees
for Textual Entailment Recognition
Fabio Massimo Zanzotto
DISCo
University Of Milan-Bicocca
Milano, Italy
zanzotto@disco.unimib.it
Alessandro Moschitti
DISP
University Of Rome ”Tor Vergata”
Roma, Italy
moschitti@info.uniroma2.it
Abstract
In this paper we present a novel similarity
between pairs of co-indexed trees to auto-
matically learn textual entailment classi-
fiers. We defined a kernel function based
on this similarity along with a more clas-
sical intra-pair similarity. Experiments
show an improvement of 4.4 absolute per-
cent points over state-of-the-art methods.
1 Introduction
Recently, a remarkable interest has been devoted to
textual entailment recognition (Dagan et al., 2005).
The task requires to determine whether or not a text
T entails a hypothesis H. As it is a binary classifica-
tion task, it could seem simple to use machine learn-
ing algorithms to learn an entailment classifier from
training examples. Unfortunately, this is not. The
learner should capture the similarities between dif-
ferent pairs, (Tprime,Hprime) and (Tprimeprime,Hprimeprime), taking into ac-
count the relations between sentences within a pair.
For example, having these two learning pairs:
T1 ⇒H1
T1 “At the end of the year, all solid compa-
nies pay dividends”
H1 “At the end of the year, all solid
insurance companies pay dividends.”
T1 notdblarrowright H2
T1 “At the end of the year, all solid compa-
nies pay dividends”
H2 “At the end of the year, all solid compa-
nies pay cash dividends.”
determining whether or not the following implica-
tion holds:
T3 ⇒H3?
T3 “All wild animals eat plants that have
scientifically proven medicinal proper-
ties.”
H3 “All wild mountain animals eat plants
that have scientifically proven medici-
nal properties.”
requires to detect that:
1. T3 is structurally (and somehow lexically) sim-
ilar to T1 and H3 is more similar to H1 than to
H2;
2. relations between the sentences in the pairs
(T3,H3) (e.g., T3 and H3 have the same noun
governing the subject of the main sentence) are
similar to the relations between sentences in the
pairs (T1,H1) and (T1,H2).
Given this analysis we may derive that T3 ⇒ H3.
The example suggests that graph matching tec-
niques are not sufficient as these may only detect
the structural similarity between sentences of textual
entailment pairs. An extension is needed to consider
also if two pairs show compatible relations between
their sentences.
In this paper, we propose to observe textual entail-
ment pairs as pairs of syntactic trees with co-indexed
nodes. This shuold help to cosider both the struc-
tural similarity between syntactic tree pairs and the
similarity between relations among sentences within
a pair. Then, we use this cross-pair similarity with
more traditional intra-pair similarities (e.g., (Corley
and Mihalcea, 2005)) to define a novel kernel func-
tion. We experimented with such kernel using Sup-
port Vector Machines on the Recognizing Textual
Entailment (RTE) challenge test-beds. The compar-
ative results show that (a) we have designed an ef-
fective way to automatically learn entailment rules
33
from examples and (b) our approach is highly accu-
rate and exceeds the accuracy of the current state-of-
the-art models.
In the remainder of this paper, Sec. 2 introduces
the cross-pair similarity and Sec. 3 shows the exper-
imental results.
2 Learning Textual Entailment from
examples
To carry out automatic learning from exam-
ples, we need to define a cross-pair similarity
K((Tprime,Hprime),(Tprimeprime,Hprimeprime)). This function should con-
sider pairs similar when: (1) texts and hypotheses
are structurally and lexically similar (structural sim-
ilarity); (2) the relations between the sentences in
the pair (Tprime,Hprime) are compatible with the relations
in (Tprimeprime,Hprimeprime) (intra-pair word movement compatibil-
ity). We argue that such requirements could be met
by augmenting syntactic trees with placeholders that
co-index related words within pairs. We will then
define a cross-pair similarity over these pairs of co-
indexed trees.
2.1 Training examples as pairs of co-indexed
trees
Sentence pairs selected as possible sentences in en-
tailment are naturally co-indexed. Many words (or
expressions) wh in H have a referent wt in T. These
pairs (wt,wh) are called anchors. Possibly, it is
more important that the two words in an anchor are
related than the actual two words. The entailment
could hold even if the two words are substitued with
two other related words. To indicate this we co-
index words associating placeholders with anchors.
For example, in Fig. 1, 2” indicates the (compa-
nies,companies) anchor between T1 and H1. These
placeholders are then used to augment tree nodes. To
better take into account argument movements, place-
holders are propagated in the syntactic trees follow-
ing constituent heads (see Fig. 1).
In line with many other researches (e.g., (Cor-
ley and Mihalcea, 2005)), we determine these an-
chors using different similarity or relatedness dec-
tors: the exact matching between tokens or lemmas,
a similarity between tokens based on their edit dis-
tance, the derivationally related form relation and
the verb entailment relation in WordNet, and, fi-
nally, a WordNet-based similarity (Jiang and Con-
rath, 1997). Each of these detectors gives a different
weight to the anchor: the actual computed similarity
for the last and 1 for all the others. These weights
will be used in the final kernel.
2.2 Similarity between pairs of co-indexed
trees
Pairs of syntactic trees where nodes are co-indexed
with placeholders allow the design a cross-pair simi-
larity that considers both the structural similarity and
the intra-pair word movement compatibility.
Syntactic trees of texts and hypotheses permit to
verify the structural similarity between pairs of sen-
tences. Texts should have similar structures as well
as hypotheses. In Fig. 1, the overlapping subtrees
are in bold. For example, T1 and T3 share the sub-
tree starting with S → NP VP. Although the lexicals
in T3 and H3 are quite different from those T1 and
H1, their bold subtrees are more similar to those of
T1 and H1 than to T1 and H2, respectively. H1 and
H3 share the production NP → DT JJ NN NNS while
H2 and H3 do not. To decide on the entailment for
(T3,H3), we can use the value of (T1,H1).
Anchors and placeholders are useful to verify if
two pairs can be aligned as showing compatible
intra-pair word movement. For example, (T1,H1)
and (T3,H3) show compatible constituent move-
ments given that the dashed lines connecting place-
holders of the two pairs indicates structurally equiv-
alent nodes both in the texts and the hypotheses. The
dashed line between 3 and b links the main verbs
both in the texts T1 and T3 and in the hypotheses H1
and H3. After substituting 3 to b and 2 to a , T1
and T3 share the subtree S → NP 2 VP 3 . The same
subtree is shared between H1 and H3. This implies
that words in the pair (T1,H1) are correlated like
words in (T3,H3). Any different mapping between
the two anchor sets would not have this property.
Using the structural similarity, the placeholders,
and the connection between placeholders, the over-
all similarity is then defined as follows. Let Aprime and
Aprimeprime be the placeholders of (Tprime,Hprime) and (Tprimeprime,Hprimeprime),
respectively. The similarity between two co-indexed
syntactic tree pairs Ks((Tprime,Hprime),(Tprimeprime,Hprimeprime)) is de-
fined using a classical similarity between two trees
KT(t1,t2) when the best alignment between the Aprime
and Aprimeprime is given. Let C be the set of all bijective
34
T1 T3
S
PP
IN
At
NP 0
NP 0
DT
the
NN 0
end
0
PP
IN
of
NP 1
DT
the
NN 1
year
1
,
,
NP 2
DT
all
JJ 2
solid
2’
NNS 2
companies
2”
VP 3
VBP 3
pay
3
NP 4
NNS 4
dividends
4
S
NP a
DT
All
JJ a
wild
a’
NNS a
animals
a”
VP b
VBP b
eat
b
NP c
plants
c ... properties
H1 H3
S
PP
IN
At
NP 0
NP 0
DT
the
NN 0
end
0
PP
IN
of
NP 1
DT
the
NN 1
year
1
,
,
NP 2
DT
all
JJ 2
solid
2’
NN
insurance
NNS 2
companies
2”
VP 3
VBP 3
pay
3
NP 4
NNS 4
dividends
4
S
NP a
DT
All
JJ a
wild
a’
NN
mountain
NNS a
animals
a”
VP b
VBP b
eat
b
NP c
plants
c ... properties
H2 H3
S
PP
At ... year
NP 2
DT
all
JJ 2
solid
2’
NNS 2
companies
2”
VP 3
VBP 3
pay
3
NP 4
NN
cash
NNS 4
dividends
4
S
NP a
DT
All
JJ a
wild
a’
NN
mountain
NNS a
animals
a”
VP b
VBP b
eat
b
NP c
plants
c ... properties
Figure 1: Relations between (T1,H1), (T1,H2), and (T3,H3).
mappings from aprime ⊆ Aprime : |aprime| = |Aprimeprime| to Aprimeprime, an
element c ∈ C is a substitution function. The co-
indexed tree pair similarity is then defined as:
Ks((Tprime,Hprime),(Tprimeprime,Hprimeprime)) =
maxc∈C(KT(t(Hprime,c),t(Hprimeprime,i))+KT(t(Tprime,c),t(Tprimeprime,i))
where (1) t(S,c) returns the syntactic tree of the
hypothesis (text) S with placeholders replaced by
means of the substitution c, (2) i is the identity sub-
stitution and (3) KT(t1,t2) is a function that mea-
sures the similarity between the two trees t1 and t2.
2.3 Enhancing cross-pair syntactic similarity
As the computation cost of the similarity measure
depends on the number of the possible sets of corre-
spondences C and this depends on the size of the
anchor sets, we reduce the number of placehold-
ers used to represent the anchors. Placeholders will
have the same name if these are in the same chunk
both in the text and the hypothesis, e.g., the place-
holders 2’ and 2” are collapsed to 2 .
3 Experimental investigation
The aim of the experiments is twofold: we show that
(a) entailments can be learned from examples and
(b) our kernel function over syntactic structures is
effective to derive syntactic properties. The above
goals can be achieved by comparing our cross-pair
similarity kernel against (and in combination with)
other methods.
3.1 Experimented kernels
We compared three different kernels: (1) the ker-
nel Kl((Tprime,Hprime),(Tprimeprime,Hprimeprime)) based on the intra-pair
35
Datasets Kl Kl +Kt Kl +Ks
Train:D1 Test:T1 0.5888 0.6213 0.6300
Train:T1 Test:D1 0.5644 0.5732 0.5838
Train:D2(50%)prime Test:D2(50%)primeprime 0.6083 0.6156 0.6350
Train:D2(50%)primeprime Test:D2(50%)prime 0.6272 0.5861 0.6607
Train:D2 Test:T2 0.6038 0.6238 0.6388
Mean 0.5985 0.6040 0.6297
(± 0.0235 ) (± 0.0229 ) (± 0.0282 )
Table 1: Experimental results
lexical similarity siml(T,H) as defined in (Cor-
ley and Mihalcea, 2005). This kernel is de-
fined as Kl((Tprime,Hprime),(Tprimeprime,Hprimeprime)) = siml(Tprime,Hprime)×
siml(Tprimeprime,Hprimeprime). (2) the kernel Kl+Ks that combines
our kernel with the lexical-similarity-based kernel;
(3) the kernel Kl + Kt that combines the lexical-
similarity-based kernel with a basic tree kernel.
This latter is defined as Kt((Tprime,Hprime),(Tprimeprime,Hprimeprime)) =
KT(Tprime,Tprimeprime)+KT(Hprime,Hprimeprime). We implemented these
kernels within SVM-light (Joachims, 1999).
3.2 Experimental settings
For the experiments, we used the Recognizing Tex-
tual Entailment (RTE) Challenge data sets, which
we name as D1, T1 and D2, T2, are the develop-
ment and the test sets of the first and second RTE
challenges, respectively. D1 contains 567 examples
whereas T1, D2 and T2 have all the same size, i.e.
800 instances. The positive examples are the 50%
of the data. We produced also a random split of D2.
The two folds are D2(50%)prime and D2(50%)primeprime.
We also used the following resources: the Char-
niak parser (Charniak, 2000) to carry out the syntac-
tic analysis; the wn::similaritypackage (Ped-
ersen et al., 2004) to compute the Jiang&Conrath
(J&C) distance (Jiang and Conrath, 1997) needed to
implement the lexical similarity siml(T,H) as de-
fined in (Corley and Mihalcea, 2005); SVM-light-
TK (Moschitti, 2004) to encode the basic tree kernel
function, KT , in SVM-light (Joachims, 1999).
3.3 Results and analysis
Table 1 reports the accuracy of different similar-
ity kernels on the different training and test split de-
scribed in the previous section. The table shows
some important result.
First, as observed in (Corley and Mihalcea, 2005)
the lexical-based distance kernel Kl shows an accu-
racy significantly higher than the random baseline,
i.e. 50%. This accuracy (second line) is comparable
with the best systems in the first RTE challenge (Da-
gan et al., 2005). The accuracy reported for the best
systems, i.e. 58.6% (Glickman et al., 2005; Bayer
et al., 2005), is not significantly far from the result
obtained with Kl, i.e. 58.88%.
Second, our approach (last column) is signifi-
cantly better than all the other methods as it pro-
vides the best result for each combination of train-
ing and test sets. On the “Train:D1-Test:T1” test-
bed, it exceeds the accuracy of the current state-of-
the-art models (Glickman et al., 2005; Bayer et al.,
2005) by about 4.4 absolute percent points (63% vs.
58.6%) and 4% over our best lexical similarity mea-
sure. By comparing the average on all datasets, our
system improves on all the methods by at least 3 ab-
solute percent points.
Finally, the accuracy produced by our kernel
based on co-indexed trees Kl + Ks is higher than
the one obtained with the plain syntactic tree ker-
nel Kl + Kt. Thus, the use of placeholders and co-
indexing is fundamental to automatically learn en-
tailments from examples.

References
Samuel Bayer, John Burger, Lisa Ferro, John Henderson, and
Alexander Yeh. 2005. MITRE’s submissions to the eu pas-
cal rte challenge. In Proceedings of the 1st Pascal Challenge
Workshop, Southampton, UK.
Eugene Charniak. 2000. A maximum-entropy-inspired parser.
In Proc. of the 1st NAACL, pages 132–139, Seattle, Wash-
ington.
Courtney Corley and Rada Mihalcea. 2005. Measuring the se-
mantic similarity of texts. In Proc. of the ACL Workshop
on Empirical Modeling of Semantic Equivalence and Entail-
ment, pages 13–18, Ann Arbor, Michigan, June. Association
for Computational Linguistics.
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The
PASCAL RTE challenge. In PASCAL Challenges Workshop,
Southampton, U.K.
Oren Glickman, Ido Dagan, and Moshe Koppel. 2005. Web
based probabilistic textual entailment. In Proceedings of the
1st Pascal Challenge Workshop, Southampton, UK.
Jay J. Jiang and David W. Conrath. 1997. Semantic similarity
based on corpus statistics and lexical taxonomy. In Proc. of
the 10th ROCLING, pages 132–139, Tapei, Taiwan.
Thorsten Joachims. 1999. Making large-scale svm learning
practical. In B. Schlkopf, C. Burges, and A. Smola, editors,
Advances in Kernel Methods-Support Vector Learning. MIT
Press.
Alessandro Moschitti. 2004. A study on convolution kernels
for shallow semantic parsing. In proceedings of the ACL,
Barcelona, Spain.
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi.
2004. Wordnet::similarity - measuring the relatedness of
concepts. In Proc. of 5th NAACL, Boston, MA.
