Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 819–826,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Logic-based Semantic Approach to Recognizing Textual Entailment
Marta Tatu and Dan Moldovan
Language Computer Corporation
Richardson, Texas, 75080
United States of America
marta,moldovan@languagecomputer.com
Abstract
This paper proposes a knowledge repre-
sentation model and a logic proving set-
ting with axioms on demand success-
fully used for recognizing textual entail-
ments. It also details a lexical inference
system which boosts the performance of
the deep semantic oriented approach on
the RTE data. The linear combination of
two slightly different logical systems with
the third lexical inference system achieves
73.75% accuracy on the RTE 2006 data.
1 Introduction
While communicating, humans use different ex-
pressions to convey the same meaning. One of
the central challenges for natural language under-
standing systems is to determine whether different
text fragments have the same meaning or, more
generally, if the meaning of one text can be de-
rived from the meaning of another. A module
that recognizes the semantic entailment between
two text snippets can be employed by many NLP
applications. For example, Question Answering
systems have to identify texts that entail expected
answers. In Multi-document Summarization, the
redundant information should be recognized and
omitted from the summary.
Trying to boost research in textual inferences,
the PASCAL Network proposed the Recognizing
Textual Entailment (RTE) challenges (Dagan et al.,
2005; Bar-Haim et al., 2006). For a pair of two text
fragments, the task is to determine if the meaning
of one text (the entailed hypothesis denoted by a0 )
can be inferred from the meaning of the other text
(the entailing text or a1 ).
In this paper, we propose a model to represent
the knowledge encoded in text and a logical set-
ting suitable to a recognizing semantic entailment
system. We cast the textual inference problem as
a logic implication between meanings. Text a1 se-
mantically entails a0 if its meaning logically im-
plies the meaning of a0 . Thus, we, first, transform
both text fragments into logic form, capture their
meaning by detecting the semantic relations that
hold between their constituents and load these rich
logic representations into a natural language logic
prover to decide if the entailment holds or not.
Figure 1 illustrates our approach to RTE. The fol-
lowing sections of the paper shall detail the logic
proving methodology, our logical representation
of text and the various types of axioms that the
prover uses.
To our knowledge, there are few logical ap-
proaches to RTE. (Bos and Markert, 2005) rep-
resents a1 and a0 into a first-order logic trans-
lation of the DRS language used in Discourse
Representation Theory (Kamp and Reyle, 1993)
and uses a theorem prover and a model builder
with some generic, lexical and geographical back-
ground knowledge to prove the entailment be-
tween the two texts. (de Salvo Braz et al., 2005)
proposes a Description Logic-based knowledge
representation language used to induce the repre-
sentations of a1 and a0 and uses an extended sub-
sumption algorithm to check if any of a1 ’s rep-
resentations obtained through equivalent transfor-
mations entails a0 .
2 Cogex - A Logic Prover for NLP
Our system uses COGEX (Moldovan et al., 2003),
a natural language prover originating from OT-
TER (McCune, 1994). Once its set of support is
loaded with a1 and the negated hypothesis (a2 a0 )
and its usable list with the axioms needed to gener-
819
Figure 1: COGEX’s Architecture
ate inferences, COGEX begins to search for proofs.
To every inference, an appropriate weight is as-
signed depending on the axiom used for its deriva-
tion. If a refutation is found, the proof is complete;
if a refutation cannot be found, then predicate ar-
guments are relaxed. When argument relaxation
fails to produce a refutation, entire predicates are
dropped from the negated hypothesis until a refu-
tation is found.
2.1 Proof scoring algorithm
Once a proof by contradiction is found, its score is
computed by starting with an initial perfect score
and deducting points for each axiom utilized in the
proof, every relaxed argument, and dropped predi-
cate. The computed score is a measure of the kinds
of axioms used in the proof and the significance of
the dropped arguments and predicates. If we as-
sume that both text fragments are existential, then
a1a1a0
a0 if and only if
a1 ’s entities are a subset of
a0 ’s entities (Some smart people read
a0 Some peo-
ple read) and penalizing a pair whose a0 contains
predicates that cannot be inferred is a correct way
to ensure entailment (Some people read a2a0 Some
smart people read). But, if both a1 and a0 are uni-
versally quantified, then the groups mentioned in
a0 must be a subset of the ones from
a1 (All people
read a0 All smart people read and All smart people
read a2a0 All people read). Thus, the scoring mod-
ule adds back the points for the modifiers dropped
from a0 and subtracts points for a1 ’s modifiers not
present in a0 . The remaining two cases are sum-
marized in Table 1.
Because a3 a1a5a4 a0a7a6 pairs with longer sentences can
potentially drop more predicates and receive a
lower score, COGEX normalizes the proof scores
by dividing the assessed penalty by the maximum
assessable penalty (all the predicates from a0 are
dropped). If this final proof score is above a
threshold learned on the development data, then
the pair is labeled as positive entailment.
3 Knowledge Representation
For the textual entailment task, our logic prover
uses a two-layered logical representation which
captures the syntactic and semantic propositions
encoded in a text fragment.
3.1 Logic Form Transformation
In the first stage of our representation pro-
cess, COGEX converts a1 and a0 into logic
forms (Moldovan and Rus, 2001). More specifi-
cally, a predicate is created for each noun, verb,
adjective and adverb. The nouns that form a noun
compound are gathered under a nn NNC predi-
cate. Each named entity class of a noun has a
corresponding predicate which shares its argument
with the noun predicate it modifies. Predicates for
820
(a0a2a1 ,a3a5a4 ) (a3a6a1 ,a0a7a4 )
All people read a0 Some smart people read Some people read a2a0 All smart people read
All smart people read a0 Some people read Some smart people read a2a0 All people read
Add the dropped points for a0 ’s modifiers Subtract points for modifiers not present in a0
Table 1: The quantification of a1 and a0 influences the proof scoring algorithm
prepositions and conjunctions are also added to
link the text’s constituents. This syntactic layer of
the logic representation is, automatically, derived
from a full parse tree and acknowledges syntax-
based relationships such as: syntactic subjects,
syntactic objects, prepositional attachments, com-
plex nominals, and adjectival/adverbial adjuncts.
In order to objectively evaluate our represen-
tation, we derived it from two different sources:
constituency parse trees (generated with our
implementation of (Collins, 1997)) and depen-
dency parse trees (created using Minipar (Lin,
1998))1. The two logic forms are slightly dif-
ferent. The dependency representation captures
more accurately the syntactic dependencies
between the concepts, but lacks the semantic
information that our semantic parser extracts from
the constituency parse trees. For instance, the
sentence Gilda Flores was kidnapped on the 13th
of January 19902 is “constituency” represented
as Gilda NN(x1) & Flores NN(x2) &
nn NNC(x3,x1,x2) & human NE(x3) &
kidnap VB(e1,x9,x3) & on IN(e1,x8)
& 13th NN(x4) & of NN(x5) &
January (x6) & 1990 NN(x7)
& nn NNC(x8,x4,x5,x6,x7) &
date NE(x8) and its “dependency”
logic form is Gilda Flores NN(x2)
& human NE(x2) &
kidnap VB(e1,x4,x2) & on IN(e1,x3)
& 13th NN(x3) & of IN(x3,x1) &
January 1990 NN(x1).
3.1.1 Negation
The exceptions to the one-predicate-per-
open-class-word rule include the adverbs not
and never. In cases similar to further de-
tails were not released, the system removes
1The experimental results described in this paper were
performed using two systems: the logic prover when
it receives as input the constituency logic representation
(COGEXa8 ) and the dependency representation (COGEXa9 ).
2All examples shown in this paper are from the entail-
ment corpus released as part of the Second RTE challenge
(www.pascal-network.org/Challenges/RTE2).
The RTE datasets will be described in Section 7.
not RB(x3,e1) and negates the verb’s
predicate (-release VB(e1,x1,x2)).
Similarly, for nouns whose determiner is no,
for example, No case of indigenously ac-
quired rabies infection has been confirmed, the
verb’s predicate is negated (case NN(x1) &
-confirm VB(e2,x15,x1)).
3.2 Semantic Relations
The second layer of our logic representation adds
the semantic relations, the underlying relation-
ships between concepts. They provide the se-
mantic background for the text, which allows for
a denser connectivity between the concepts ex-
pressed in text. Our semantic parser takes free En-
glish text or parsed sentences and extracts a rich
set of semantic relations3 between words or con-
cepts in each sentence. It focuses not only on
the verb and its arguments, but also on seman-
tic relations encoded in syntactic patterns such as
complex nominals, genitives, adjectival phrases,
and adjectival clauses. Our representation mod-
ule maps each semantic relation identified by the
parser to a predicate whose arguments are the
events and entities that participate in the rela-
tion and it adds these semantic predicates to the
logic form. For example, the previous logic form
is augmented with the THEME SR(x3,e1) &
TIME SR(x8,e1) relations4 (Gilda Flores is
the theme of the kidnap event and 13th of January
1990 shows the time of the kidnapping).
3.3 Temporal Representation
In addition to the semantic predicates, we
represent every date/time into a normal-
ized form time TMP(BeginFn(event),
year, month, date, hour, minute,
second) & time TMP(EndFn(event),
year, month, date, hour, minute,
second). Furthermore, temporal reasoning
3We consider relations such as AGENT,
THEME, TIME, LOCATION, MANNER, CAUSE,
INSTRUMENT, POSSESSION, PURPOSE,
MEASURE, KINSHIP, ATTRIBUTE, etc.
4R(x,y) should be read as “x is R of y”.
821
predicates are derived from both the detected
semantic relations as well as from a module
which utilizes a learning algorithm to detect
temporally ordered events (a3a1a0 a4a3a2a5a4 a4a3a2a7a6 a6 , where
a0 is the temporal signal linking two events
a2 a4 and a2 a6 ) (Moldovan et al., 2005). From
each triple, temporally related SUMO predicates
are generated based on hand-coded rules for
the signal classes (a3a1a0 sequence, a2a5a4 a4a3a2a7a6 a6 a8
earlier TMP(e1,e2), a3a1a0 contain, a2 a4 a4a3a2 a6 a6a9a8
during TMP(e1,e2), etc.). In the above
example, 13th of January 1990 is normalized
to the interval time TMP(BeginFn(e2),
1990, 1, 13, 0, 0, 0) &
time TMP(EndFn(e2), 1990, 1, 13,
23, 59, 59) and during TMP(e1,e2) is
added to the logical representation to show when
the kidnapping occurred.
4 Axioms on Demand
COGEX’s usable list consists of all the axioms
generated either automatically or by hand. The
system generates axioms on demand for a given
a3 a1a5a4
a0a7a6 pair whenever the semantic connectivity
between two concepts needs to be established in
a proof. The axioms on demand are lexical chains
and world knowledge axioms. We are keen on the
idea of axioms on demand since it is not possible
to derive apriori all axioms needed in an arbitrary
proof. This brings a considerable level of robust-
ness to our entailment system.
4.1 eXtended WordNet lexical chains
For the semantic entailment task, the ability to
recognize two semantically-related words is an
important requirement. Therefore, we automat-
ically construct lexical chains of WordNet rela-
tions from a1 ’s constituents to a0 ’s (Moldovan and
Novischi, 2002). In order to avoid errors intro-
duced by a Word Sense Disambiguation system,
we used the first a10 senses for each word5 un-
less the source and the target of the chain are
synonyms. If a chain exists6, the system gener-
ates, on demand, an axiom with the predicates
of the source (from a1 ) and the target (from a0 ).
5Because WordNet senses are ranked based on their fre-
quency, the correct sense is most likely among the first a11 . In
our experiments, a11a13a12a15a14 .
6Each lexical chain is assigned a weight based on its prop-
erties: shorter chains are better than longer ones, the relations
are not equally important and their order in the chain influ-
ences its strength. If the weight of a chain is above a given
threshold, the lexical chain is discarded.
For example, given the ISA relation between mur-
der#1 and kill#1, the system generates, when
needed, the axiom murder VB(e1,x1,x2)
a16 kill VB(e1,x1,x2). The remaining of
this section details some of the requirements for
creating accurate lexical chains.
Because our extended version of Word-
Net has attached named entities to each noun
synset, the lexical chain axioms append the
entity name of the target concept, whenever
it exists. For example, the logic prover uses
the axiom Nicaraguan JJ(x1,x2) a16
Nicaragua NN(x1) & country NE(x1)
when it tries to infer electoral campaign is held in
Nicaragua from Nicaraguan electoral campaign.
We ensured the relevance of the lexical chains
by limiting the path length to three relations and
the set of WordNet relations used to create the
chains by discarding the paths that contain certain
relations in a particular order. For example, the
automatic axiom generation module does not con-
sider chains with an IS-A relation followed by a
HYPONYMY link (a17a19a18a21a20a23a22a25a24a27a26a29a28
a30a32a31a3a33a35a34
a36
a16
a22a37a20a1a38a40a39
a41a43a42a40a44a43a45a47a46a48a42a50a49a51a42
a36
a16
a52a54a53
a38a47a55a56a28a57a20a23a38 ). Similarly, the system rejected chains
with more than one HYPONYMY relations. Al-
though these relations link semantically related
concepts, the type of semantic similarity they in-
troduce is not suited for inferences. Another re-
striction imposed on the lexical chains generated
for entailment is not to start from or include too
general concepts7. Therefore, we assigned to each
noun and verb synset from WordNet a generality
weight based on its relative position within its hi-
erarchy and on its frequency in a large corpus. If
a58
a30 is the depth of concept
a22
a30 ,
a52
a4a60a59 is the max-
imum depth in a22 a30 ’s hierarchy a0 a30 and a61a62a17 a3a63a22 a30 a6a65a64
a36a67a66
a28a57a26 a3a69a68 a3a63a22
a30
a6 a6 is the information content of
a22
a30 mea-
sured on the British National Corpus, then
a26
a53a71a70a60a53
a55a56a24
a66
a20a1a38a40a39a73a72 a3a63a22
a30
a6a74a64 a75
a76
a59a78a77 a4
a79a81a80
a59a83a82
a61a84a17 a3a63a22
a30
a6a29a85
In our experiments, we discarded the chains with
concepts whose generality weight exceeded 0.8
such as object NN#1, act VB#1, be VB#1, etc.
Another important change that we intro-
duced in our extension of WordNet is the re-
finement of the DERIVATION relation which
links verbs with their corresponding nominal-
ized nouns. Because the relation is ambigu-
ous regarding the role of the noun, we split
7There are no restrictions on the target concept.
822
this relation in three: ACT-DERIVATION, AGENT-
DERIVATION and THEME-DERIVATION. The
role of the nominalization determines the ar-
gument given to the noun predicate. For in-
stance, the axioms act VB(e1,x1,x2) a16
acting NN(e1)(ACT), act VB(e1,x1,x2)
a16 actor NN(x1) (AGENT) reflect different
types of derivation.
4.2 NLP Axioms
Our NLP axioms are linguistic rewriting rules that
help break down complex logic structures and
express syntactic equivalence. After analyzing
the logic form and the parse trees of each text
fragment, the system, automatically, generates
axioms to break down complex nominals and
coordinating conjunctions into their constituents
so that other axioms can be applied, individually,
to the components. These axioms are made avail-
able only to the a3 a1a5a4 a0a7a6 pair that generated them.
For example, the axiom nn NNC(x3,x1,x2)
& francisco NN(x1) & merino NN(x2)
a16 merino NN(x3) breaks down the noun
compound Francisco Merino into Francisco and
Merino and helps COGEX infer Merino’s home
from Francisco Merino’s home.
4.3 World Knowledge Axioms
Because, sometimes, the lexical or the syntactic
knowledge cannot solve an entailment pair, we
exploit the WordNet glosses, an abundant source
of world knowledge. We used the logic forms
of the glosses provided by eXtended WordNet8
to, automatically, create our world knowledge
axioms. For example, the first sense of noun Pope
and its definition the head of the Roman Catholic
Church introduces the axiom Pope NN(x1)
a0 head NN(x1) & of IN(x1,x2) &
Roman Catholic Church NN(x2) which is
used by prover to show the entailment between
a1 : A place of sorrow, after Pope John Paul II
died, became a place of celebration, as Roman
Catholic faithful gathered in downtown Chicago
to mark the installation of new Pope Benedict
XVI. and a0 : Pope Benedict XVI is the new leader
of the Roman Catholic Church.
We also incorporate in our system a small
common-sense knowledge base of 383 hand-
coded world knowledge axioms, where 153 have
been manually designed based on the entire de-
8http://xwn.hlt.utdallas.edu
velopment set data, and 230 originate from pre-
vious projects. These axioms express knowledge
that could not be derived from WordNet regarding
employment9, family relations, awards, etc.
5 Semantic Calculus
The Semantic Calculus axioms combine two se-
mantic relations identified within a text fragment
and increase the semantic connectivity of the
text (Tatu and Moldovan, 2005). A semantic ax-
iom which combines two relations, a1 a30 and a1a3a2 , is
devised by observing the semantic connection be-
tween the a4 a4 and a4a6a5 words for which there exists
at least one other word, a4 a6 , such that a1 a30 a3a7a4 a4 a4a8a4 a6 a6
(a4 a4
a9
a59
a16
a4 a6 ) and
a1a10a2
a3a7a4 a6 a4a8a4a10a5
a6 (
a4 a6
a9a12a11
a16
a4a10a5 ) hold true.
We note that not any two semantic relations can
be combined: a1 a30 and a1a3a2 have to be compatible
with respect to the part-of-speech of the common
argument. Depending on their properties, there
are up to 8 combinations between any two se-
mantic relations and their inverses, not counting
the combinations between a semantic relation and
itself10. Many combinations are not semantically
significant, for example, KINSHIP SR(x1,x2)
& TEMPORAL SR(x2,e1) is unlikely to be
found in text. Trying to solve the semantic
combinations one comes upon in text corpora,
we analyzed the RTE development corpora and
devised rules for some of the a1 a30a14a13 a1a3a2 combina-
tions encountered. We validated these axioms
by checking all the a3a7a4a83a4 a4a8a4 a5 a6 pairs from the LA
Times text collection such that a3
a1
a30a15a13
a1a16a2 a6
a3a7a4 a4 a4a8a4a10a5
a6
holds. We have identified 82 semantic axioms
that show how semantic relations can be com-
bined. These axioms enable inference of unstated
meaning from the semantics detected in text.
For example, if a1 states explicitly the KINSHIP
(KIN) relations between Nicholas Cage and
Alice Kim Cage and between Alice Kim Cage
and Kal-el Coppola Cage, the logic prover uses
the KIN SR(x1,x2) & KIN SR(x2,x3)
a16 KIN SR(x1,x3) semantic axiom (the
transitivity of the blood relation) and the sym-
metry of this relationship (KIN SR(x1,x2)
9For example, the axiom country NE(x1) &
negotiator NN(x2) & nn NNC(x3,x1,x2) a17
work VB(e1,x2,x4) & for IN(e1,x1) helps the
prover infer that Christopher Hill works for the US from top
US negotiator, Christopher Hill.
10Harabagiu and Moldovan (1998) lists the exact number
of possible combinations for several WordNet relations and
part-of-speech classes.
823
a16 KIN SR(x2,x1)) to infer
a0 ’s statement
(KIN(Kal-el Coppola Cage, Nicholas Cage)). An-
other frequent axiom is LOCATION SR(x1,x2)
& PARTWHOLE SR(x2,x3) a16
LOCATION SR(x1,x3). Given the text
John lives in Dallas, Texas and using the axiom,
the system infers that John lives in Texas. The
system applies the 82 axioms independent of
the concepts involved in the semantic compo-
sition. There are rules that can be applied only
if the concepts that participate satisfy a certain
condition or if the relations are of a certain
type. For example, LOCATION SR(x1,x2)
& LOCATION SR(x2,x3) a16
LOCATION SR(x1,x3) only if the LOCATION
relation shows inclusion (John is in the car in the
garage a16 LOCATION SR(John,garage).
John is near the car behind the garage a2a16
LOCATION SR(John,garage)).
6 Temporal Axioms
One of the types of temporal axioms that we load
in our logic prover links specific dates to more
general time intervals. For example, October 2000
entails the year 2000. These axioms are automati-
cally generated before the search for a proof starts.
Additionally, the prover uses a SUMO knowledge
base of temporal reasoning axioms that consists
of axioms for a representation of time points and
time intervals, Allen (Allen, 1991) primitives, and
temporal functions. For example, during is a tran-
sitive Allen primitive: during TMP(e1,e2)
& during TMP(e2,e3) a16
during TMP(e1,e3).
7 Experiments and Results
The benchmark corpus for the RTE 2005 task con-
sists of seven subsets with a 50%-50% split be-
tween the positive entailment examples and the
negative ones. Each subgroup corresponds to a
different NLP application: Information Retrival
(IR), Comparable Documents (CD), Reading Com-
prehension (RC), Question Answering (QA), Infor-
mation Extraction (IE), Machine Translation (MT),
and Paraphrase Acquisition (PP). The RTE data
set includes 1367 English a3 a1a5a4 a0a7a6 pairs from the
news domain (political, economical, etc.). The
RTE 2006 data covered only four NLP tasks (IE, IR,
QA and Multi-document Summarization (SUM))
with an identical split between positive and nega-
tive examples. Table 2 presents the data statistics.
Development set Test set
RTE 2005 567 800
RTE 2006 800 800
Table 2: Datasets Statistics
7.1 COGEX’s Results
Tables 3 and 4 summarize COGEX’s performance
on the RTE datasets, when it received as input the
different-source logic forms11.
On the RTE 2005 data, the overall performance
on the test set is similar for both logic proving
runs, COGEX a0 and COGEXa79 . On the development
set, the semantically enhanced logic forms helped
the prover distinguish better the positive entail-
ments (COGEX a0 has an overall higher precision
than COGEXa79 ). If we analyze the performance on
the test data, then COGEX a0 performs slightly bet-
ter on MT, CD and PP and worse on the RC, IR and
QA tasks. The major differences between the two
logic forms are the semantic content (incomplete
for the dependency-derived logic forms) and, be-
cause the text’s tokenization is different, the num-
ber of predicates in a0 ’s logic forms is different
which leads to completely different proof scores.
On the RTE 2006 test data, the system which
uses the dependency logic forms outperforms
COGEX a0 . COGEXa79 performs better on almost all
tasks (except SUM) and brings a significant im-
provement over COGEX a0 on the IR task. Some
of the positive examples that the systems did not
label correctly require world knowledge that we
do not have encoded in our axiom set. One ex-
ample for which both systems returned the wrong
answer is pair 353 (test 2006) where, from China’s
decade-long practice of keeping its currency val-
ued at around 8.28 yuan to the dollar, the system
should recognize the relation between the yuan
and China’s currency and infer that the currency
used in China is the yuan because a country’s cur-
rency a0 currency used in the country. Some of
the pairs that the prover, currently, cannot handle
involve numeric calculus and human-oriented es-
timations. Consider, for example, pair 359 (dev
set, RTE 2006) labeled as positive, for which the
logic prover could not determine that 15 safety vi-
olations a0 numerous safety violations.
The deeper analysis of the systems’ output
11For the RTE 2005 data, we list the confidence-weighted
score (cws) (Dagan et al., 2005) and, for the RTE 2006 data,
the average precision (ap) measure (Bar-Haim et al., 2006).
824
Task COGEXa8 COGEXa9 LEXALIGN COMBINATION
acc cws f acc cws f acc cws f acc cws f
IE 58.33 60.90 60.31 57.50 57.03 51.42 56.66 53.41 59.99 62.50 67.63 57.14
IR 52.22 62.41 15.68 53.33 59.67 27.58 50.00 55.92 0.00 68.88 75.77 64.10
CD 82.00 88.90 79.69 79.33 87.15 74.38 82.00 88.04 80.57 84.66 91.73 82.70
QA 50.00 56.27 0.00 51.53 42.37 64.80 53.07 43.76 63.90 60.76 55.05 63.82
RC 53.57 56.38 38.09 57.14 59.32 58.33 57.85 60.26 49.57 60.00 62.89 50.00
MT 55.83 55.83 53.91 52.50 58.17 27.84 51.66 45.94 67.04 64.16 63.80 66.66
PP 56.00 63.11 26.66 54.00 58.15 30.30 50.00 47.03 0.00 68.00 75.27 63.63
TEST 59.37 63.09 48.00 59.12 57.17 54.52 59.12 55.74 59.17 67.25 67.64 64.69
DEV 63.66 63.44 64.48 61.19 63.63 57.52 62.08 59.94 60.83 70.37 71.89 66.66
Table 3: RTE 2005 data results (accuracy, confidence-weighted score, and f-measure for the true class)
Task COGEXa8 COGEXa9 LEXALIGN COMBINATION
acc ap f acc ap f acc ap f acc ap f
IE 58.00 49.71 57.57 59.00 59.74 63.71 54.00 49.70 67.14 71.50 62.99 71.36
IR 62.50 65.91 56.14 73.50 72.50 73.89 64.50 69.45 65.02 74.00 74.30 72.92
QA 62.00 67.30 48.64 64.00 68.16 57.64 58.50 55.78 57.86 70.50 75.10 66.67
SUM 74.50 77.60 74.62 74.00 79.68 73.73 70.50 76.82 73.05 79.00 80.33 78.13
TEST 64.25 66.31 60.16 67.62 70.69 67.50 61.87 57.64 66.07 73.75 71.33 72.37
DEV 64.50 64.05 66.19 69.00 70.92 69.31 62.25 62.66 62.72 75.12 76.28 76.83
Table 4: RTE 2006 data results (accuracy, average precision, and f-measure for the true class)
showed that while WordNet lexical chains and
NLP axioms are the most frequently used axioms
throughout the proofs, the semantic and tempo-
ral axioms bring the highest improvement in ac-
curacy, for the RTE data.
7.2 Lexical Alignment
Inspired by the positive examples whose a0 is in
a high degree lexically subsumed by a1 , we de-
veloped a shallow system which measures their
overlap by computing an edit distance between the
text and the hypothesis. The cost of deleting a
word from a1 a3a7a4 a1 a16
a82
a6 is equal to 0, the cost
of replacing a word from a1 with another from a0
a3a7a4 a1
a16
a4 a4 , where a4 a1 a2
a64
a4 a4 and a4 a1 and a4 a4 are
not synonyms in WordNeta6 equal to a0 (we do not
allow replace operations) and the cost of inserting
a word from a0 a3
a82
a16
a4 a4
a6 varies with the part-
of-speech of the inserted word (higher values for
WordNet nouns, adjectives or adverbs, lower for
verbs and a minimum value for everything else).
Table 5 shows a minimum cost alignment.
The performance of this lexical method (LEX-
ALIGN) is shown in Tables 3 and 4. The align-
ment technique performs significantly better on
the a3 a1 a4 a0 a6 pairs in the CD (RTE 2005) and SUM
(RTE 2006) tasks. For these tasks, all three sys-
tems performed the best because the text of false
pairs is not entailing the hypothesis even at the lex-
ical level. For pair 682 (test set, RTE 2006), a1
and a0 have very few words overlapping and there
are no axioms that can be used to derive knowl-
edge that supports the hypothesis. Contrarily, for
the IE task, the systems were fooled by the high
word overlap between a1 and a0 . For example, pair
678’s text (test set, RTE 2006) contains the entire
hypothesis in its if clause. For this task, we had the
highest number of false positives, around double
when compared to the other applications. LEX-
ALIGN works surprisingly well on the RTE data. It
outperforms the semantic systems on the 2005 QA
test data, but it has its limitations. The logic rep-
resentations are generated from parse trees which
are not always accurate (a1 86% accuracy). Once
syntactic and semantic parsers are perfected, the
logical semantic approach shall prove its potential.
7.3 Merging three systems
Because the two logical representations and the
lexical method are very different and perform
better on different sets of tasks, we combined
the scores returned by each system12 to see if a
mixed approach performs better than each individ-
ual method. For each NLP task, we built a classi-
fier based on the linear combination of the three
scores. Each task’s classifier labels pair a20 as pos-
itive if a2a4a3 a45a6a5a8a7a10a9
a8a12a11
a22a37a28a48a55
a53
a0 a3 a20
a6a14a13 a2a15a3
a45a6a5a8a7a10a9
a9a16a11
a22a25a28a48a55
a53
a79
a3 a20
a6a14a13
12Each system returns a score between 0 and 1, a number
close to 0 indicating a probable negative example and a num-
ber close to 1 indicating a probable positive example. Each
a17a19a18a21a20a23a22a25a24 pair’s lexical alignment score,
a26a28a27a10a29a31a30a31a32a31a33a35a34a37a36a39a38a41a40
a59a43a42a10a44 , is the
normalized average edit distance cost.
825
a18 : The Council of Europe has * 45 member states. Three countries from ...
DEL INS DEL
a22 : The Council of Europe * is made up by 45 member states. *
Table 5: The lexical alignment for RTE 2006 pair 615 (test set)
a2a1a0
a7a10a9a3a2a5a4 a30 a5a3a46
a11
a22a37a28a48a55
a53 a0
a7a10a9a3a2a5a4 a30 a5a50a46
a3 a20
a6a7a6a9a8
a85a11a10
, where the op-
timum values of the classifier’s real-valued pa-
rameters (a2 a3 a45a6a5a8a7a10a9
a8
a4
a2a15a3
a45a6a5a8a7a10a9
a9
a4
a2a1a0
a7a10a9a3a2a5a4 a30 a5a3a46 ) were deter-
mined using a grid search on each development
set. Given the different nature of each application,
the a2 parameters vary with each task. For exam-
ple, the final score given to each IE 2006 pair is
highly dependent on the score given by COGEX
when it received as input the logic forms created
from the constituency parse trees with a small cor-
rection from the dependency parse trees logic form
system13. For the IE task, the lexical alignment
performs the worst among the three systems. On
the other hand, for the IR task, the score given by
LEXALIGN is taken into account14. Tables 3 and
4 summarize the performance of the three system
combination. This hybrid approach performs bet-
ter than all other systems for all measures on all
tasks. It displays the same behavior as its depen-
dents: high accuracy on the CD and SUM tasks and
many false positives for the IE task.
8 Conclusion
In this paper, we present a logic form represen-
tation of knowledge which captures syntactic de-
pendencies as well as semantic relations between
concepts and includes special temporal predicates.
We implemented several changes to our Word-
Net lexical chains module which lead to fewer un-
sound axioms and incorporated in our logic prover
semantic and temporal axioms which decrease its
dependence on world knowledge. We plan to im-
prove our logic prover to detect false entailments
even when the two texts have a high word overlap
and expand our axiom set.
References
J. Allen. 1991. Time and Time Again: The Many Ways
to Represent Time. Internatinal Journal of Intelli-
gent Systems, 4(6):341–355.
R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro, D. Gi-
ampiccolo, B. Magnini, and I. Szpektor. 2006. The
Second PASCAL Recognising Textual Entailment
13a12a14a13a16a15
a42
a34 a36a18a17 a12a20a19a22a21a11a19
a20
a12a14a13a23a15
a42
a34 a36a18a24 a12a26a25a3a21 a14
a20
a12
a33a35a34a37a36a39a38a41a40
a59a43a42a10a44
a12a20a27a28a25a29a21 a30
14a12 a13a16a15
a42
a34 a36a18a17 a12a31a25a3a21 a14
a20
a12 a13a23a15
a42
a34 a36a18a24 a12a26a25a3a21a32a19
a20
a12
a33a35a34a37a36a39a38a41a40
a59a43a42a10a44
a12a31a25a29a21 a30
Challenge. In Proceedings of the Second PASCAL
Challenges Workshop.
J. Bos and K. Markert. 2005. Recognizing Textual
Entailment with Logical Inference. In Proceedings
of HLT/EMNLP 2005, Vancouver, Canada, October.
M. Collins. 1997. Three Generative, Lexicalized Mod-
els for Statistical Parsing. In Proceedings of the
ACL-97.
I. Dagan, O. Glickman, and B. Magnini. 2005. The
PASCAL Recognising Textual Entailment Chal-
lenge. In Proceedings of the PASCAL Challenges
Workshop, Southampton, U.K., April.
R. de Salvo Braz, R. Girju, V. Punyakanok, D. Roth,
and M. Sammons. 2005. An Inference Model for
Semantic Entailment in Natural Language. In Pro-
ceedings of AAAI-2005.
S. Harabagiu and D. Moldovan. 1998. Knowledge
Processing on Extended WordNet. In Christiane
Fellbaum, editor, WordNet: an Electronic Lexical
Database and Some of its Applications, pages 379–
405. MIT Press.
H. Kamp and U. Reyle. 1993. From Discourse to
Logic: Introduction to Model-theoretic Semantics
of Natural Language, Formal Logic and Discourse
Representation Theory. Kluwer Academic Publish-
ers.
D. Lin. 1998. Dependency-based Evaluation of MINI-
PAR. In Workshop on the Evaluation of Parsing Sys-
tems, Granada, Spain, May.
William W. McCune, 1994. OTTER 3.0 Reference
Manual and Guide.
D. Moldovan and A. Novischi. 2002. Lexical chains
for Question Answering. In Proceedings of COL-
ING, Taipei, Taiwan, August.
D. Moldovan and V. Rus. 2001. Logic Form Transfor-
mation of WordNet and its Applicability to Question
Answering. In Proceedings of ACL, France.
D. Moldovan, C. Clark, S. Harabagiu, and S. Maio-
rano. 2003. COGEX A Logic Prover for Question
Answering. In Proceedings of the HLT/NAACL.
D. Moldovan, C. Clark, and S. Harabagiu. 2005. Tem-
poral Context Representation and Reasoning. In
Proceedings of IJCAI, Edinburgh, Scotland.
M. Tatu and D. Moldovan. 2005. A Semantic Ap-
proach to Recognizing Textual Entailment. In Pro-
ceedings of HLT/EMNLP.
826
