Entailment, Intensionality and Text Understanding
Cleo Condoravdi, Dick Crouch, Valeria de Paiva, Reinhard Stolle, Daniel G. Bobrow
PARC
3333 Coyote Hill Road
Palo Alto, CA, USA, 94304
rdc+@parc.com
Abstract
We argue that the detection of entailment and
contradiction relations between texts is a min-
imal metric for the evaluation of text under-
standing systems. Intensionality, which is
widespread in natural language, raises a number
of detection issues that cannot be brushed aside.
We describe a contexted clausal representation,
derived from approaches in formal semantics,
that permits an extended range of intensional
entailments and contradictions to be tractably
detected.
1 Introduction
What are the appropriate metrics for evaluating perfor-
mance in text understanding? There is probably no one
universal measure that suffices, leading to a collection
of metrics for evaluating different facets of text under-
standing. This paper makes the case for the inclusion of
one particular evaluation metric in this collection: namely
the detection of entailment and contradiction relations be-
tween texts / portions of texts.
Relations of entailment and contradiction are the key
data of semantics, as traditionally viewed as a branch of
linguistics. The ability to recognize such semantic rela-
tions is clearly not a sufficient criterion for language un-
derstanding: there is more to language understanding than
just being able to tell that one sentence follows from an-
other. But we would argue that it is a minimal, necessary
criterion. If you understand sentences (1) and (2), then
you can recognize that they are contradictory.
(1) No civilians were killed in the Najaf suicide bomb-
ing.
(2) Two civilians died in the Najaf suicide bombing.
Conversely, if you fail to recognize the contradiction, then
you cannot have understood (1) and (2).
In proposing an evaluation metric, the onus is on the
proposer to do a number of things. First, to show that the
metric measures something real and useful: in this case,
that entailment and contradiction detection (ECD) mea-
sures an important facet of language understanding, and
that it correlates with the ability to develop useful appli-
cations (section 2). Second, to indicate the range of tech-
nical challenges that the metric raises: section 3 empha-
sizes one of these — the need to deal with intensional en-
tailments, and the wisdom of drawing on the large body
of relevant work in formal semantics in attempting to do
so. Third, to show that the metric is not impossibly diffi-
cult for current technologies to satisfy, so that it encour-
ages technological progress rather than stunting it: sec-
tion 4 discusses a prototype system (described more fully
in (Crouch et al., 2002)) to argue that, with current tech-
nology, ECD is a realistic though challenging metric.
2 Entailment and Contradiction Metrics
2.1 Theoretical Justification
The ability to recognize entailment and contradiction re-
lations is a consequence of language understanding, as
examples (1)–(2) show. But before concluding that en-
tailment and contradiction detection is a suitable evalua-
tion metric for text understanding, two cautionary points
should be addressed. First, it cannot be a sufficient met-
ric, since there is more to understanding than entailment
and contradiction, and we should ask what aspects of un-
derstanding it does not evaluate. Second, we need to be
reasonably sure that it is a necessary metric, and does
not measure some merely accidental manifestation of un-
derstanding. To give an analogy, clearing up spots is a
consequence of curing infections like measles; but clear-
ing spots is a poor metric, especially if success can be
achieved by bleaching spots off the skin or covering them
with make-up. A measles-cure metric should address the
presence of the infection, and not just its symptoms.
In terms of (in)sufficicency, we should note that under-
standing a text implies two abilities. (i) You can relate the
text to the world, and know what the world would have
to be like if the text were true or if you followed instruc-
tions contained in it.1 (ii) You can relate the text to other
texts, and can tell where texts agree or disagree in what
they say. Clearly, entailment and contradiction detection
directly measures only the second ability.
In terms of necessity, there are two points to be made.
The first is simply an appeal to intuition. Given a pre-
theoretical grasp of what language understanding is, the
ability to draw inferences and detect entailments and con-
tradictions just does seem to be part of understanding, and
not merely an accidental symptom of it. The second point
is more technical. Suppose we assume the standard ma-
chinery of modern logic, linking proof theory and model
theory. Then a proof-theoretic ability to detect entail-
ments and contradictions between expressions is intrin-
sically linked to a model-theoretic ability to relate those
expressions to (abstract) models of the world. In other
words, the abilities to relate texts to texts and texts to
the world are connected, and there are at least some ap-
proaches that show how success in the former feeds into
success in the latter.
The reference to logic and in particular to model the-
ory is deliberate. It provides an arsenal of tools for deal-
ing with entailment and contradiction, and there is also
a large body of work in formal semantics linking natural
language to these tools. One should at least consider mak-
ing use of these resources. However, it is important not
to characterize entailment and contradiction so narrowly
as to preclude other methods. There needs to be room for
probabilistic / Bayesian notions of inference, e.g. (Pearl,
1991), as well as attempting to use corpus based methods
to detect entailment / subsumption, e.g. the use of TF-
IDF by (Monz and de Rijke, 2001). That is, one can agree
on the importance of entailment and contradiction detec-
tion as an evaluation mertic, while disagreeing on the best
methods for achieving success.
2.2 Practical Justification
Even if we grant that entailment and contradiction detec-
tion (ECD) measures a core aspect of language under-
standing, it does not follow that it measures a useful as-
pect of understanding. However, we can point to at least
two application areas that directly demonstrate the utility
of the metric.
The first is an application that we are actually work-
1Knowing what the world would be like if the text were true
is not the same as being able to tell if the text is true. I know how
things would have to be for it to be true that “There is no greatest
pair of prime numbers, a0a2a1 anda0a4a3 , such thata0a4a3a6a5a7a0a2a1a9a8a11a10 .” But
I have no idea how to tell whether this is true or not.
ing on, concerning quality maintenance for document col-
lections. The Eureka system includes a large textual
database containing engineer-authored documents (tips)
about the repair and maintenance of printers and photo-
copiers. Over time, duplicate and inconsistent material
builds up, undermining the utility of the database to field
engineers. Human validators who maintain the quality
of the document collection would benefit from ECD text
analysis tools that locate points of contradiction and en-
tailment between different but related tips in the database.
A second application building fairly directly on ECD
would be yes-no question answering. Positive or negative
answers to yes-no questions can be characterized as those
that (respectively) entail or contradict a declarative form
of the query. Yes-no question answering would be useful
for autonomous systems that attempt to interpret and act
on information acquired from textual sources, rather than
merely pre-filtering it for human interpretation and action.
Despite its relevance to applications like the above, one
of the advantages of ECD is a degree of task neutrality.
Entailment and contradiction relations can be character-
ized independently of the use, if any, to which they are
put. Many other reasonable metrics for language under-
standing are not so task neutral. For example, in a dia-
logue system one measure of understanding would be suc-
cess in taking a (task) appropriate action or making an ap-
propriate response. However, it can be non-trivial to de-
termine how much of this success is due to language un-
derstanding and how much due to prior understanding of
the task: a good, highly constraining task model can over-
come many deficiencies in language processing.
Task neutrality is not the same as domain or genre neu-
trality. ECD can depend on domain knowledge. For ex-
ample, if I do not know that belladonna and deadly night-
shade name the same plant, I will not recognize that an
instruction to uproot belladonna entails an instruction to
uproot deadly nightshade. But this is arguably a failure of
botanical knowledge, not a lapse in language understand-
ing. We will return to the issue of domain dependence
later. However, there are many instances where ECD does
not depend on domain knowledge, e.g. (1)–(2) or (3)–(4)
(taken, with simplifications, from the Eureka corpus).
(3) Corrosion caused intermittent electrical contact.
(4) Corrosion prevented continuous electrical contact.
One does not need to be an electrician to recognize the po-
tential equivalence of (3) and (4); merely that intermittent
means non-continuous, so that causing something to be
intermittent can be the same as preventing it from being
continuous. And even in cases where domain knowledge
is required, ECD is still also reliant on linguistic knowl-
edge of this kind.
The success of methods for ECD may also depend on
genre. For newswire stories (Monz and de Rijke, 2001)
reports that TF-IDF performs well in detecting subsump-
tion (i.e. entailment) between texts. This may be a con-
sequence of the way that newswires convey generally
consistent information about particular individuals and
events: reference to the same entities is highly correlated
with subsumption in such a genre. The use of PLSA on
the Eureka corpus (Brants and Stolle, 2002) was less suc-
cessful: the corpus has less reference to concrete events
and individuals, and contains conflicting diagnoses and
recommendations for repair actions.
3 Intensionality
The detection of entailments and contradictions between
pieces of text raises a number of technical challenges, in-
cluding but not limited to the following. (a) Ambigu-
ity is ubiquitous in natural language, and poses an espe-
cial problem for text processing, where longer sentences
tend to increase grammatical ambiguity, and where it is
not generally possible to enter into clarificatory dialogues
with the text author. Ambiguity impacts ECD because se-
mantic relations may hold under some interpretations but
not under others. (b) Reference resolution in the broad
sense of determining that two texts talk about the same
things, rather than the narrower sense of intra-text pro-
noun resolution, is also crucial to ECD. Entailment and
contradiction relations presuppose shared subject matter,
and reference resolution plays a role in establishing this.
(c) World/domain knowledge, as we noted before, can be
involved in establishing entailment and contradiction re-
lations. (d) Representations that enable ECD must be de-
rived from texts. What should these representations be
like, and how should they be derived? At a bare minimum
some level of parsing to obtain predicate-argument struc-
tures seems necessary, but how much more than this is re-
quired?
We cannot address all of these issues in this paper, and
so will focus on the last one. In particular, we want to
point out that intensional constructions are commonplace
in text, and that simple first-order predicate-argument
structures are inadequate for detecting intensional entail-
ments and contradictions. Within the formal semantics
literature since at least Montague, the phenomena raised
by intensionality are well known and extensively studied,
though not always satisfactorily dealt with. Yet this has
been poorly reflected in computational work relating lan-
guage understanding and knowledge representation. For-
mal semanticists have the luxury of not having to per-
form automated inference on their semantic representa-
tions, and can trade tractability for expressiveness. Com-
putational applications on the other hand have traded ex-
pressiveness for tractability, either by trying to shoe-horn
everything into an ill-fitting first-order representation, or
by coding up special purpose and not easily generaliz-
able methods for dealing with particular intensional phe-
nomena in special tasks and domains. None of these ap-
proaches are particularly satisfactory for the task of de-
tecting substantial numbers of entailment and contradic-
tion relations between texts. A more balanced trade-off is
required, and we suggest at least one way in which ma-
chinery from formal semantics can be adapted to support
this.
3.1 Intensionality is pervasive
Intensionality extends beyond the conventional examples
of propositional attitudes (beliefs, desires etc) and formal
semanticists seeking unicorns. Any predication that has
a a proposition, fact or property denoting argument intro-
duces intensionality. Almost every lexical item that takes
a clausal or predicative argument should be seen as inten-
sional. As an anecdotal test of how common this is, in-
spection of 100 Eureka tips about the workaday world of
printer and copier repair showed that 453 out of 1586 sen-
tences contained at least one verb sub-categorizing for a
clausal argument. Some randomly selected examples of
intensional constructions are given in (5).
(5) a. When the rods are removed and replaced it is
very easy to hit the glass tab and break it off.
b. The weight of the ejected sets is not sufficient to
keep the exit switch depressed.
c. This is a workaround but also disables the ability
to use the duplex tray after pressing the “Inter-
rupt” button, which should be explained to the
customer.
d. Machines using the defective toner may require
repair or replacement of the Cleaner Assembly.
Nor is intensionality confined to lexical items taking
clausal or predicative arguments, as sentences (3) and (4)
demonstrate. Prevention and causation (of central im-
portance within the Eureka domain) are inherently inten-
sional notions To say that “A prevented B” is to say that
there was an occurrence of A and no occurrence of B, but
that had A not occurred B would have occurred. Simi-
larly, to say that “A caused B” is to say that there was an
occurrence of both A and B, but that had there been no oc-
currence of A there would have been no occurence of B.
Both refer to things or events materialized in one context
but not in another. It is plain that we cannot give a seman-
tic analysis for (6a) along the lines of (6b)
(6) a. Corrosion prevented continuous contact.
b. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a14a13a15a11a17a16a19a18a20a11a14a21a23a22a24a1a8a25a4a26a27a9a12a11a14a21a29a28a31a30a2a9a32a28a12a22a24a5a33a25
a26a27a9a12a11a14a21a29a28a20a18a34a21a29a35a8a11a14a35a29a16a36a22a24a5a33a25a37a26a39a38a40a13a15a41a43a42a2a41a43a21a29a28a12a22a24a1a4a3a6a5a33a25
c. a0a2a1a4a3a6a5a8a7a19a9a12a11a14a13a14a13a15a11a17a16a19a18a20a11a14a21a23a22a24a1a8a25a4a26a27a9a12a11a14a21a29a28a31a30a2a9a32a28a12a22a24a5a33a25
a26a27a9a12a11a14a21a29a28a20a18a34a21a29a35a8a11a14a35a29a16a36a22a24a5a33a25a37a26a39a38a40a13a15a41a43a42a2a41a43a21a29a28a12a22a24a1a4a3a6a5a33a25
a26a27a41a43a1a40a18a31a16a19a28a6a16a36a22a24a1a8a25a29a26a45a44a46a41a43a1a40a18a31a16a19a28a6a16a36a22a24a5a33a25
since this asserts the existence of the continuous contact
that was prevented. In (Condoravdi et al., 2001) we ar-
gued at some length that preserving a first-order analy-
sis along the lines suggested by (Hirst, 1991) — through
the introduction of explicit existence predicates (6c) —
is at best a partial solution. Not only are identity criteria
for non-existent entities problematic, but (6c) also fails to
capture significant monotonicity entailments: Corrosion
preventing continuous contact does not imply that corro-
sion prevents contact of any form; but first order inference
allows one to drop the a9a12a11a14a21a29a28a20a18a34a21a29a35a8a11a14a35a29a16a36a22a24a5a33a25 conjunct from (6c),
yielding the representation one would expect for corro-
sion prevented contact.
We do not completely rule out the possibility that
some more sophisticated, ontologically promiscuous,
first-order analysis (perhaps along the lines of (Hobbs,
1985)) might account for these kinds of monotonicity in-
ferences. But a more overtly intensional analysis like (7)
does not face this problem in the first place.
(7) a0a2a1a4a7a36a9a12a11a14a13a14a13a15a11a17a16a19a18a20a11a14a21a23a22a24a1a8a25
a26a39a38a40a13a15a41a43a42a2a41a43a21a29a28a12a22a24a1a4a3a1a0 a0a2a5a8a7a43a9a12a11a14a21a29a28a31a30a2a9a32a28a12a22a24a5a33a25a8a26a27a9a12a11a14a21a29a28a20a18a34a21a29a35a8a11a14a35a29a16a36a22a24a5a33a25a3a2a24a25
In (7) we assume thata38a40a13a15a41a43a42a2a41a43a21a29a28 carries a lexical entailment
that its second, propositional, argument is false. Thus (7)
rules out the existence of continuous contact, but does not
rule out the existence of any form of contact. Hirst, how-
ever, points out that allowing quantification over individ-
uals into intensional contexts brings in its wake other well
known difficulties: what does it mean for the same in-
dividual to exist in different possible worlds? In some
sense, this is the trans-world analogue of the problematic
identity criteria for non-existent individuals.
In (Condoravdi et al., 2001) we proposed an alternative
analysis, (8), based on viewing noun phrases as being con-
cept denoting rather than individual denoting (Zimmer-
mann, 1993).
(8) a0a5a4 a3a7a6 a7a36a16a19a35a9a8a32a9a12a11a14a21 a9a12a41 a38a40a28a12a22a10a4 a3a12a11 a11a14a13a14a13a15a11a17a16a19a18a20a11a14a21a4a25
a26a45a16a19a35a9a8a32a9a12a11a14a21 a9a12a41 a38a40a28a12a22a13a6 a3a14a11 a11a14a21a29a28a31a30a2a9a32a28a16a15a17a11 a11a14a21a29a28a20a18a34a21a29a35a8a11a14a35a29a16a14a25
a26a39a38a40a13a15a41a43a42a2a41a43a21a29a28a12a22a10a4 a3a7a6 a25
This says that there is some sub-type of corrosion, a4 , and
some sub-type of continuous contact, a6 , such that con-
cept a4 prevents concept a6 . This means, amongst other
things, that there is some instance of a4 but no instance
of a6 . Of course, just because there are no instances of
continuous contact, it does not follow that there are no in-
stances of contact, and (8) predicts the correct monotonic-
ity entailments. Moreover, since concepts are functions
from possible worlds to their extensions (sets of individu-
als), the issue of the trans-world identity of concepts does
not arise: any particular concept expresses a single func-
tion, regardless of possible world.2
2Uniform identity of concepts across possible worlds does
not mean that substitution of concepts that are co-extensive in
one world is always truth preserving. Thus our use of concepts
is intensional in the philosophically traditional sense, which is
a point of clarification requested by one of our anonymous re-
viewers.
3.2 Detecting an Intensional Entailment
In (Condoravdi et al., 2001) we went into greater depth
about how an analysis like (8) formally predicts the right
kinds of entailment. Our purpose here is not to repeat
these arguments, still less to argue that ours is the only
possible way of accounting for these facts. Rather, we
want to show how this highly intensional, analysis can be
deployed for practical ECD.
As an example consider determining the possible mu-
tual entailment between (3) and (4), repeated below.
(9) Corrosion caused intermittent electrical contact.
(10) Corrosion prevented continuous electrical contact.
The lexical semantics for a9a12a30a36a35a29a16a43a41 and a38a40a13a15a41a43a42a2a41a43a21a29a28 can be
stated as follows (where we use the term “context” instead
of “possible world”):
(11) If a38a40a13a15a41a43a42a2a41a43a21a29a28a12a22a18a11a20a19a15a3a21a11a23a22a43a25 is true in context a24 then
(a) In context a24 the concept a11a25a19 is instantiated and
concept a11a23a22 is uninstantiated, and
(b) There is a context a26 that is maximally similar to
a24 with the exception that a11a25a19 is uninstantiated in a26 ,
and in a26 concept a11a27a22 is instantiated.
(12) If a9a12a30a36a35a29a16a43a41a2a22a18a11a20a19a14a3a21a11a23a22a43a25 is true in context a24 then
(a) In context a24 the concept a11a25a19 is instantiated and
concept a11a23a22 is also instantiated, and
(b) There is a context a26 that is maximally similar to
a24 with the exception that a11a25a19 is uninstantiated in a26 ,
and in a26 concept a11a27a22 is also uninstantiated.
Applying these definitions to (3) and (4), on the assump-
tion that both statements are true in some context a24 :
(13) If cause(corrosion, intermittent-contact) is true in a24
then
(a) In a24 there is an instance of corrosion and an in-
stance of intermittent contact, and
(b) There is a context a26 that is maximally similar to
a24 except that there is no instance of corrosion, where
there is no instance of intermittent contact; hence ei-
ther there is no contact at all, or contact in a26 is non-
intermittent (i.e. continuous).
(14) If prevent(corrosion, continuous-contact) is true in
a24 then
(a) In a24 there is an instance of corrosion but no in-
stance of continuuous contact; hence either there is
no contact in a24 , or contact is non-continuous (i.e. in-
termittent).
(b) There is a context a26 that is maximally similar to
a24 except that there is no instance of corrosion, where
there is an instance of continuous contact.
Both (13) and (14) refer to a relation of maximal similar-
ity between contexts, with respect to the instantiation of a
particular concept. The nature of this relation has delib-
erately not been spelled out, as it is unnecessary to do so
in order to detect the possible entailment relation between
(13) and (14). Assuming that both are evaluated against
the same initial context a24 , they both invoke counterfac-
tual contexts a26 that are maximally similar to a24 with re-
spect to the concept of corrosion. Moreover, provided we
pick the right disjunctive alternatives for non-intermittent
and non-continous contact, we can see that a24 and a26 have
the same contents in both cases. Thus, whatever maximal
similarity might turn out to be, (3) and (4) can be analysed
as introducing the same contexts related in the same ways:
that is, mutual entailment.3
Before describing how this example can be general-
ized to a scheme for detecting certain classes of inten-
sional entailments and contradictions, we want to empha-
size one point. The example makes free use of the no-
tion of one context/possible world being maximally simi-
lar to another, with respect to the instantiation of a partic-
ular concept. Relations of maximum similarity between
worlds are standard fare within formal, model-theoretic
semantics, and alternative definitions abound. It is prob-
ably fair to say that the notion is not yet well understood.
Fortunately for our example, full understanding of maxi-
mal similarity is not required. We only need to know that
the same relation applies to the same initial context (a24 ) to
pick out the same counterfactual contexts (a26 ). Of course,
other examples may necessitate spelling out the relation in
more detail. For instance, suppose we had the statement
that rust caused intermittent contact, where rust is a sub-
type of corrosion. This raises the question of how maxi-
mal similarity varies across the type hierarchy; i.e. how
does a maximally similar context with no instance of rust
compare to one with no instance of corrosion? To answer
this, we still do not need to specify fully the maximal sim-
ilarity relation; merely state some of its necessary prop-
erties. Ultimately, though, if we want to use such formal
means to relate language to the world, then relations like
maximal similarity will have to be fully spelled out. But
this is not the task that ECD sets out to deal with.
3.3 A General Approach to Intensional Entailments
The example above points to a general, two stage strat-
egy for ECD. First map texts to contexted clauses, show-
ing what contexts there are, and what (atomic) facts hold
3Note that if we pick the other disjunctive alternative for
non-intermittent contact, i.e. no instance of contact at all, then
(3) can be shown to contradict (4): (3) says that corrosion causes
an intermittent short circuit, while in (4) it intermittently breaks
a contact should that be present. We do not yet have anything
very useful to say about preferences between such interpreta-
tions, though we have been exploring the use of evidential rea-
soning.
in them. Then attempt to pair contexts between the two
text representations, and use relatively limited inference
to determine whether the facts in paired contexts entail or
contradict. We will look at these two stages in turn.
Contexted Clauses A contexted atomic clause com-
prises an atomic fact, plus the context in which the fact
is supposed to hold. Borrowing McCarthy’s notation4 we
write (ist a0 contexta1a2a0 facta1 ) to state that some fact holds
in some context. A list of flat contexted clauses is inter-
preted conjunctively. Consider the contexted clauses de-
rived from (8)
(15) a. Corrosion prevented continuous contact.
b. (ist t (instantiated corrosion1))
(ist t (uninstantiated contact2))
(ist t (prevent corrosion1 contact2))
(maxsim prevent-context3 t corrosion1)
(ist prevent-context3 (uninstantiated contact2))
(ist prevent-context3 (instantiated corrosion1))
(subconcept corrosion1 corrosion)
(subconcept contact2 contact)
(subconcept contact2 continuous)
Here we have a number of facts about what holds in the in-
titial context, t: that there is an instance of a sub-concept
of corrosion but no instance of some sub-concept of (con-
tinuous) contact, and that the prevent relation holds be-
tween the corrosion and contact concepts. This relation
also introduces a new context, prevent-context3, which is
maximally similar to t with respect to corrosion. Within
prevent-context3, alternative, counterfactual assertions
about concept instantiations are made. Finally, and inde-
pendently of any particular context, subconcept assertions
are made. The first says that corrosion1 is some (unspeci-
fied) subconcept of the concept corrosion. This statement
is not relativized to a context, since the concept hierarchy
is assumed to be constant across all contexts (even though
extensions of concepts can vary).
The ‘flattening’ of (8) to derive (15b) proceeds via
skolemization, conversion to clausal form, the relativiza-
tion of each conjunct to a context and canonicalization to
introduce extra contextual structure that is only implicit
in linguistic forms (the context prevent-context3, corre-
sponding to the counterfactual state of affairs the lexi-
cal entailments of prevent make reference to), or domain
knowledge.
The canonicalization process is both language and
knowledge/ontology-driven,introducing a deeper level of
semantic representation. Structures assembled by com-
positional semantics must thus be transformed to struc-
tures that are well-suited for making successive small,
4Though not borrowing McCarthy’s view of contexts as
subsumption-ordered logical micro-theories.
automated inference steps. Performing comparison on
canonicalized contexted representations reflects a compu-
tationally advantageous division of labor: highly directed
use of world knowledge and inference in the service of
creating meaning representations, followed by relatively
lightweight inference procedures in the stage of determin-
ing inferential relations between texts. Further aspects of
canonicalization to conceptual structure based on a lin-
guistically independent knowledge representation are dis-
cussed in (Crouch et al., 2002), e.g. mapping word senses
onto term in a domain appropriate concept hierarchy.
A more complex example of flattening and canonical-
ization is (16), which is ambiguous between it being the
removal of the sleeve that prevents breakage, or making
the cable flexible that prevents breakage. The initial log-
ical form for the second interpretation is shown in (16),
and the packed contexted representation for both parses
is (partially) shown in (18).
(16) Removing a sleeve made the cable flexible, prevent-
ing breakage.
(17) a0 C. subconcept(C, cable) & def(C, ?A) &
prevent(make([a0 s. sleeve(s) & a0 a. agent-pro(a)
& remove(a, s)]
flexible(C)),
a0 breakage))
(18) (ist t (uninstantiated breakage-type2))
(ist prevent-ctx5 (instantiated breakage-type2))
(ist make-ctx3 (make remove-ev4 flexible-ctx5))
(ist make-ctx3 (sleeve sleeve6(make-ctx3)))
(ist make-ctx3
(remove remove-ev4 agent7(make-ctx3)
sleeve6(make-ctx3))
(ist flexible-ctx5 (flexible cable1))
(ist prevent-ctx5 (instantiated breakage-type2))
(subcontext make-ctx3 t)
(subcontext flexible-ctx5 make-ctx3)
(subconcept cable1 cable)
(concepteq cable1 part-12KE45)
(subconcept breakage-type2 breakage)
(parse 1
(ist t (prevent make-ctx3 breakage-type2)))
(parse 2
(ist t (prevent remove-ev4 breakage-type2)))
(parse 1
(maxsim prevent-ctx5 t make-ctx3))
(parse 2
(maxsim prevent-ctx5 t remove-ev4))
Amongst other things, note how the proposition argu-
ment a1 a30 a2 a41a2a22a31a7a19a7a19a7 a25 is replaced by the new context name
make-ctx3, and component clauses asserted within this
new context. Note also how skolem functions like
sleeve6(make-ctx3) take context terms as arguments, and
how the hook for definite reference by “the cable”, def(C,
?A), is canonicalized to a concept equality, where part-
12KE45 is some recently mentioned machine part. Also,
maxsim can be relativized either to an event type, remove-
ev4, or a context, make-ctx3.
Context Matching Having obtained contexted repre-
sentations for two texts, ECD proceeds in two stages.
First, by assuming that both texts describe the same ini-
tial context, locate sub-contexts introduced by the two
texts that have parallel relations to the initial context. Sec-
ond, for the contexts thus paired identify local entailments
and contradictions using first-order reasoning. Given our
use of concepts, much of this can be done using T-box
reasoning from description logics. At present, we only
view identical context relations as parallel, and do not
give much consideration to the inheritance of proposi-
tional content between related contexts. A deeper level
of matching would be based on an algebra of contexts de-
tailing different types of context relations and their inher-
itance properties.
4 Feasibility of ECD
The last section described one way of approaching inten-
sionality in the setting of entailment and contradiction de-
tection. Our intention has not been to claim that this is
the “one true way” of dealing with intensional ECD. It
is rather to demonstrate the claim that practical progress
can be made in the area, and that formal model-theoretic
semantics can make a contribution to this. However, the
preceding discussion has arguably been at too abstract and
theoretical a level to really demonstrate a claim of prac-
tical progress or feasibility. This section briefly discusses
a prototype entailment and contradiction detection system
(described at greater length in (Crouch et al., 2002)) in or-
der to point out that current technology already makes it
feasible to begin addressing ECD.
The system has been developed around the Eureka col-
lection of printer and copier repair tips. The full collection
contains 30–40,000 free text documents. We have been
focusing on a development subset of some 1,300 of these
documents, including 15 pairs that have been pulled out
for closer scrutiny because of known entailments and con-
tradictions between them. We do not as yet have any test-
ing data separate from our development data.
The system maps each document into a set of contexted
clauses, by means of full syntactic and semantic analy-
sis followed by knowledge-based canonicalization. Doc-
ument representations undergo (statistically filtered) pair-
wise comparison to identify sentences within document
pairs related by contradiction or entailment. We will de-
scribe the mapping and the comparison in turn.
The first stage of mapping uses a broad coverage, hand
coded Lexical Functional Grammar of English (Butt et
al., 1998) and the parser from the Xerox Linguistic En-
vironment (XLE) (Maxwell and Kaplan, 1993) to parse
the documents. Parsing is robust in the sense that ev-
ery sentence receives a functional-structure analysis, en-
coding grammaticalized predicate-argument structure. In
about 25% of cases the functional-structures are fragmen-
tary, either because of coverage gaps in the grammar or
because of poor spelling and punctuation (to which the
technicians writing the tips are prone). Fragments com-
prise longest span structures for constituents such as S,
NP or PP that have been successfully analysed by the
grammar. Ambiguity management via packing (Maxwell
and Kaplan, 1989) allows the parser to efficiently5 find
all possible analyses of each sentence according to the
grammar, and represent the alternatives in a compact,
structure-shared form. Evaluation of essentially the same
grammar on a dependency annotated subset of section 23
of the UPenn Wall Street Journal gives the accuracy of
best parses as 85%, increasing by another 4% for non-
fragmentary analyses (Riezler et al., 2002). Stochastic se-
lection of the most probable parse (not necessarily the best
parse) gives an accuracy of 80%.
Initial semantic interpretation is via an implementation
of “glue semantics”, which uses linear logic deduction to
assemble the meanings of words and phrases in a syn-
tactically analysed sentence (Dalrymple, 1999). Seman-
tic interpretation preserves the ambiguity packing in syn-
tactic analysis (though currently not in an algorithmically
optimal way), deals with such things as quantifier scop-
ing, and incorporates lexico-semantic information not rel-
evant to parsing.Despite theoretical proposals for deal-
ing with anaphora and ellipsis in glue interpretion, e.g.
(Crouch, 1999), this has not currently been implemented;
hooks are placed in the representation to mark where sub-
sequent canonicalization needs to resolve textual and do-
main dependencies like pronouns and compound noun in-
terpretations. Semantic analysis is also robust, with about
65% of all sentences receiving full, non-fragmentaryanal-
yses (around 60% on WSJ-23).
Canonicalization starts with a systematic flattening of
logical forms: skolemizing quantifiers, replacing inten-
sional arguments by new context names, and expanding
out the intensional arguments within their new contexts.
Rewrite rules are then applied, with the assistance of a
TMS-based evidential reasoner, to further refine the re-
sulting contexted clauses. Some rules are domain inde-
pendent simplifications of alternate linguistic construc-
tions onto the same underlying form. Others exploit onto-
logical information to map words onto appropriate word
senses or to identify domain appropriate pronouns an-
tecedents. Others introduce additional contextual struc-
5It takes a morning to distribute the 1300 development doc-
uments across half a dozen workstations and perform syntactic
and semantic analysis.
ture, or eliminate irrelevant linguistically induced con-
texts. To promote domain-portability, care is being taken
to write canonicalization rules in such a way as to distin-
guish between (a) domain independent rules, (b) general
rules with an interface to domain dependent ontologies,
and (c) domain specific hacks.
Comparison of representations starts with statistical
pre-filtering. This uses probabilistic latent semantic
analysis to identify, on the basis of word occurrences,
which documents are likely to have some content overlap
(Brants and Stolle, 2002). For candidate pairs of docu-
ments thus identified, we employ a charitable form of ref-
erence resolution: if it is possible to identify clauses or
contexts occurring in different documents, then identity
is assumed. The Structure Mapping Engine (SME) (For-
bus et al., 1989) is used to match contexts. The SME is a
graph matching algorithm developed for the recognition
of analogy. In our case it is used to match up structurally
similar context structures containing structurally similar
clauses. Having paired the contextual structures, limited
ontological inference is then used to detect contradictions
or entailments between the contents of matched contexts.
In summary, the robust application of detailed, hand-
coded rules to the syntactic and semantic analysis of open
texts appears feasible, with syntax somewhat more ad-
vanced. Similar observations have been made by other
researchers, e.g. (Siegel and Bender, 2002). Knowledge-
based canonicalization is less well advanced. In part,
progress depends on the construction of rules in many
ways similar to the grammar rules and lexical entries of
syntactic analysis. Progress also depends on the construc-
tion of appropriate ontologies.
5 Conclusion
We have argued that entailment and contradiction detec-
tion (ECD) should be included as one of a number of met-
rics for evaluating text understanding. Intensional con-
structions — predications with proposition- or property-
denoting arguments — are a challenge for ECD. They oc-
cur commonly, but simple predicate-argument represen-
tations do not do justice to the variety of inferences they
support. More sophisticated first-order accounts (Hirst,
1991; Hobbs, 1985) may be extendable to bear this load.
But there is also a direct path building on results from
possible-worlds semantics. We are developing contexted
clausal representations to aim at a useful trade-off be-
tween tractability and expressivity. Other researchers are
also building on insights from model-theoretic semantics
in interesting ways, e.g. (Schubert and Hwang, 2000).
Intensional ECD seems to presuppose deep and detailed
syntactic and semantic analysis (though we have no ar-
guments to rule out the possibility of shallower analysis).
The current state of deep language processing technology
suggests that ECD is a viable though challenging metric
for open text in restricted domains.
One issue that we have not addressed is the best form
for annotated evaluation material for ECD. Ideally, this
should be raw texts, annotated only to link the sentences
or clauses that have entailment or contradiction relations
between them. This has the benefit of being an almost
entirely theory-neutral annotation scheme. A mark-up
based around some form of semantic representation for
texts (e.g. contexted clauses) would very likely impose an
unfair penalty on alternative approaches. A limited pre-
cursor to raw-text mark-up for semantic evaluation was
undertaken as part of the FraCaS project (Cooper and Col-
leagues, 1996). This was a semantic test suite of about
350 syllogisms, specifying entailment and contradiction
relations, or the lack of them, e.g.
(19) The PC-6082 is faster than the ITEL-XZ.
The ITEL-XZ is fast.
Is the PC-6082 fast? [Yes]
Even for trivial, artificial examples like these two prob-
lems arose. (i) The premises or conclusions can be am-
biguous, where entailments of contradictions follow un-
der one set of interpretations but not under another. There
is no obvious way of marking the intended interpretations.
(ii) It is extraordinarily hard to construct examples where
inference relations do not in part depend on world knowl-
edge. By taking texts rather than sentences as the units of
annotation, the intended interpretation is generally much
clearer (to human annotators). With regard to domain de-
pendence, one just has to accept that ECD quality will de-
cline without world knowledge.

References
Thorsten Brants and Reinhard Stolle. 2002. Finding sim-
ilar documents in document collections. In Using Se-
mantics for Information Retrieval and Filtering: State
of the Art and Future Research. Workshop at LREC-
2002, Las Palmas, Spain.
M. Butt, Tracy King, M. Ni˜no, and F. Segond. 1998.
A Grammar Writer’s Cookbook. CSLI Lecture Notes.
CSLI, Stanford.
Cleo Condoravdi, Richard Crouch, Martin van den Berg,
John O. Everett, Valeria de Paiva, Reinhard Stolle,
and Daniel G. Bobrow. 2001. Preventing existence.
In C. Welty and B. Smith, editors, Formal Ontology
in Information Systems: Proc. FOIS-2001, Ogunquit,
Maine, pages 162–173. ACM Press, New York.
Robin Cooper and Colleagues. 1996. Fracas deliver-
able d16: Using the framework. Technical Report
LRE 62-051 D16, HCRC, University of Edinburgh.
www.cogsci.ed.ac.uk/fracas.
Richard Crouch, Cleo Condoravdi, Reinhard Stolle,
Tracy King, Valeria de Paiva, John O. Everett, and
Daniel G. Bobrow. 2002. Scalability of redundancy
detection in focused document collections. In Pro-
ceedings First International Workshop on Scalable
Natural Language Understanding (SCANALU-2002),
Heidelberg, Germany.
Richard Crouch. 1999. Ellipsis and glue languages.
In Shalom Lappin and Elabbas Benmamoun, editors,
Fragments: Studies in Ellipsis and Gapping. Oxford
University Press, Oxford.
Mary Dalrymple, editor. 1999. Semantics and Syntax in
Lexical Functional Grammar: The Resource Logic Ap-
proach. MIT Press, Cambridge, MA.
Kenneth D. Forbus, Brian Falkenhainer, and Dedre Gen-
tner. 1989. The structure mapping engine: Algorithm
and examples. Artificial Intelligence, 41(1):1–63.
Graeme Hirst. 1991. Existence assumptions in knowl-
edge representation. Artificial Intelligence, 49:199–
242, May.
Jerry R. Hobbs. 1985. Ontological promiscuity. In Proc.
ACL-1985, pages 61–69, Chicago, IL.
John T. Maxwell and Ronald M. Kaplan. 1989. An
overview of disjunctive constraint satisfaction. In
Proceedings of the International Workshop on Parsing
Technologies, pages 18–27.
John Maxwell and Ronald M. Kaplan. 1993. The inter-
face between phrasal and functional constraints. Com-
putational Linguistics, 19:571–589.
Christof Monz and Maarten de Rijke. 2001. Lightweight
inference for computational semantics. In Proceedings
of 3rd International Conference on Inference in Com-
putational Semantics, pages 59–72, Sienna, Italy.
Judea Pearl. 1991. Probabilistic Reasoning in Intelli-
gent Systems: Networks of Plausible Inference. Mor-
gan Kaufmann.
Stefan Riezler, Ronald Kaplan, Tracy King, Mark John-
son, Richard Crouch, and John Maxwell. 2002. Pars-
ing the Wall Street Journal using a lexical functional
grammar and discriminative estimation techniques. In
Proc. ACL-2002. To appear.
L. K. Schubert and C. H. Hwang. 2000. Episodic logic
meets little red riding hood: A comprehensive, natu-
ral representation for language understanding. In L. M.
Iwanska and S. C. Shapiro, editors, Natural Language
Processing and Knowledge Representation. MIT Press.
Melanie Siegel and Emily Bender. 2002. Efficient
deep processing of japanese. In Proc 3rd Workshop
on Asian Language Resources and International Stan-
dardization, Taipei, Taiwan.
Ede Zimmermann. 1993. On the proper treatment of
opacity in certain verbs. Natural Language Semantics,
1:149–179.
