Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pages 49–54,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Generating an Entailment Corpus from News Headlines 
 
 
John Burger, Lisa Fero 
The MITRE Corporation 
202 Burlington Rd. 
Bedford, MA 01730, USA 
{john,lfero} @ mitre.org 
 
 
 
 
Abstract 
We describe our efforts to generate a 
large (10,00 instance) corpus of textual 
entailment pairs from the lead paragraph 
and headline of news articles. We manu-
ally inspected a small set of news stories 
in order to locate the most productive 
source of entailments, then built an anno-
tation interface for rapid manual evalua-
tion of further exemplars. With this 
training data we built an SVM-based 
document classifier, which we used for 
corpus refinement purposes—we believe 
that roughly three-quarters of the resulting 
corpus are genuine entailment pairs.  We 
also discus the difficulties inherent in 
manual entailment judgment, and sugest 
ways to ameliorate some of these. 
1Introduction 
MITRE has a long-standing interest in robust text 
understanding, and, like many, we believe that 
adequate progress in such an endeavor requires a 
well-designed evaluation methodology.  We have 
explored in great depth the use of human reading 
comprehension exams for this purpose (Hirschman 
et al., 199, Wellner et al., 205) as well as TREC-
style question answering (Burger, 204). 
In this context, the recent Pascal RTE evaluation 
(Recognizing Textual Entailment, Dagan et al., 
205) captured our interest. The goal of RTE is to 
assess systems’ abilities at judging semantic en-
tailment with respect to a pair of sentences, e.g.: 
• Fred spiled wine on the carpet. 
• The rug was wet. 
In RTE parlance, the antecedent sentence is 
known as the text, while the consequent sentence is 
known as the hypothesis.  Simply put, the chal-
lenge for an RTE system is to judge whether the 
text entails the hypothesis.  Judgments are Boo-
lean, and the primary evaluation metric is simple 
accuracy, although there were other, secondary 
metrics used in the evaluation. 
The RTE organizers provided 567 exemplar sen-
tence pairs.  This is adequate for system develop-
ment, but not for the application of large-scale 
statistical models. In particular, we wished to cast 
the problem as one of statistical alignment as used 
in machine translation.  MT systems typically use 
milions of sentence pairs, and so we decided to 
find or generate a much larger corpus. This paper 
describes our efforts along these lines, as well as 
some observations about the problems of annotat-
ing entailment data.  In Section 2 we describe our 
initial search for an entailment corpus.  Section 3 
briefly describes an annotation interface we de-
vised, as well as our efforts to refine our corpus. 
Section 4 explains many of the isues and prob-
lems inherent in manual annotation of entailment 
data. 
2Finding Entailment Data 
In our study of the Pascal RTE development cor-
pus, we found that a considerable majority of the 
TRUE pairs exhibit a stronger relationship than 
entailment; namely, the hypothesis is a paraphrase 
of a subset of the text. For instance, given the text 
49
John murdered Bil yesterday, the hypothesis Bil 
is dead is an entailment, while the hypothesis Bil 
was kiled by John exhibits the stronger partial 
paraphrase relationship to the text. We found that 
94% (131/140) of the TRUE pairs in the Pascal 
RTE dev2 corpus were these sorts of paraphrases. 
In our search for an entailment corpus, we ob-
served that the headline of a news article is often a 
partial paraphrase of the lead paragraph, much like 
the RTE data, or is sometimes a genuine entail-
ment.  We thus deduced that headlines and their 
corresponding lead paragraphs might provide a 
readily available source of training data.  As an 
initial test of this hypothesis, we manually in-
spected over 20 news stories from 1 different 
sources. We found a great deal of variety in head-
line formats, and ultimately found the Xinhua 
News Agency English Service articles from the 
Gigaword corpus (Graff, 203) to be the richest 
source, though somewhat limited in subject do-
main.  We describe here our data colection and 
analysis process. 
Because our goal was to automatically generate 
an extremely large corpus of exemplars, we fo-
cused on large data sources.  We first examined 
11 news stories culed from MiTAP (Damianos et 
al., 203), which colects over one milion articles 
per month from approximately 75 different 
sources.  By first counting the number of articles 
typically colected for each source, we selected a 
mixture of sources that each had more than 10,00 
articles for our sample period of one and half 
months.  As discused further below, part way 
through our investigation it became clear that we 
needed to include more native English sources, so 
the Christian Science Monitor articles were added, 
though they fell below our arbitrary 10K mark. 
Figure 1 summarizes the MiTAP news sources ex-
amined. 
For each lead paragraph/headline pair, a human 
rendered a judgment of yes, no, or maybe as to 
whether the lead paragraph entailed the headline, 
where maybe meant that the headline was very 
close to being an entailment or paraphrase. This is 
likely equivalent to the notion of “more or less se-
mantically equivalent” used in the Microsoft Re-
search Paraphrase Corpus (Dolan et al., 205). 
The purpose of maybe in this case was that we 
thought that many of the near-mis pairs would 
make adequate training data for statistical algo-
rithms, in spite of being less than perfect. 
There were many types of news articles in the 
MiTAP data that did not yield god headline/lead 
paragraph pairs for our purposes. Many would be 
difficult to filter out using automated heuristics. 
Two frequent examples of this were opinion-ed 
itorial pieces and daily Wall Street summaries. 
Others would be more amenable to automatic 
elimination, including obituaries and colections of 
news snipets like the Washington Post’s “World 
in Brief”.  Articles consisting of personal nara-
tives never yielded god headlines, but these could 
easily be eliminated by recognizing first person 
pronouns in the lead paragraph.  Figure 2 shows 
the judgments for all the MiTAP articles examined, 
where the Filtered row excludes these easily elimi-
nated article types. 
As Figure 2 shows, the MiTAP data did not 
yield a high percentage of god pairs. In addition, 
whether due to por machine translation or English 
dialectal differences, our evaluator found it diffi-
cult to understand some of the text from sources 
that were not English-primary.  A certain amount 
of il-formed text was acceptable, since the Pascal 
RTE challenge included training and test data 
drawn from MT scenarios, but we did not wish our 
data to be to dominated by such sources.  Thus, 
we selected aditional native-English articles to 
add to our sample set. 
Despite the overall por yield from this data, it 
Source 
No. articles 
examined 
No. articles 
in 1.5 mos. 
miami-herald (US) 19 94,278 
washington-post (US) 18 13,813 
cs-monitor (US) 1 7,102 
al-africa 18 68,521 
dawn (Pakistan) 17 46,839 
gulf-daily-news 10 26,837 
national-post (Canada) 18 14,124 
Figure 1: MiTAP News Sources Examined 
 Yes No Maybe Total 
Al 
Pairs 
54 (49%) 39 (35%) 18 (16%) 11 
Filtered 54 (53%) 3 (3%) 14 (14%) 101 
Figure 2: MiTAP Corpus Results 
Source Yes No Maybe Total 
APW 8 (31% ) 12 (46%) 6 (23%) 26 
AFE 14 (56%) 4 (16%) 7 (28%) 25 
NYT 8 (31%) 17 (65%) 1 (4%) 26 
XIE 2 (85%) 4 (15%)  0 (0%) 26 
Total 52 (50%) 37 (36%) 14 (14%) 103 
Figure 3: Gigaword Corpus Results 
50
was aparent that some news sources tended to be 
more fruitful than others.  For example, 13 out of 
18 of the Washington Post articles yielded god 
pairs, as oposed to only 1 of the 1 Christian Sci-
ence Monitor articles. 
This generalization was likewise true in the sec-
ond corpus we examined, the Gigaword newswire 
corpus (Graff, 203).  Gigaword contains over 4 
milion documents from four news sources: 
• Agence France Press English Service (AFE) 
• Asociated Press Worldstream English Service 
(APW) 
• The New York Times Newswire Service 
(NYT) 
• The Xinhua News Agency English Service 
(XIE) 
For each source, Gigaword articles are classified 
into several types, including newswire advisories, 
etc.  We restricted our investigations to actual 
news stories.  As Figure 3 shows, overall results 
were much the same as the MiTAP articles, but 
85% of the XIE articles yielded adequate pairs. 
Based on these preliminary results we decided to 
focus further manual investigations on the XIE 
articles from Gigaword.  We also decided to ex-
pend some effort on an annotation tol that would 
allow us to proceed more quickly than the early 
annotation experiments described above. 
3Refining the Data 
MITRE has developed a series of annotation tols 
for a variety of linguistic phenomena (Day et al, 
197; Day et al, 204), but these are primarily de-
signed for fine-grained tasks such as named entity 
and syntactic annotation. For our headline corpus, 
we wanted the ability to rapidly annotate at a docu-
ment level from a small set of categories. Further, 
we wanted the interface to easily suport 
distributed annotation efforts. 
The resulting annotation interface is shown in 
Figure 4.  It is web-based, and annotations and 
other document information are stored in an SQL 
database.  The document to be evaluated is dis-
played in the user’s chosen browser, with the XML 
 
Figure 4: Entailment Tagging Interface 
51
document zoning tags visible so that the user can 
easily identify the headline and lead paragraph. At 
the top of the document are three butons from 
which to select a yes/no/maybe judgment.  The 
user can also add a comment before moving to the 
next document. Typically several documents can 
be judged per minute. The client-server architec-
ture suports multiple annotations of the same 
document by different annotators—accordingly, it 
has a mode enabling reconciliation of inter-
annotator disagreements.  Al further annotation 
efforts discused below were carried out with this 
tol. 
Using the tol, we tagged approximately 90 
randomly chosen Gigaword documents, including 
520 XIE documents.  From this, we estimate that 
70% of the XIE headlines in Gigaword are entailed 
by the corresponding lead paragraph.  (This is 
lower than the rough estimate described in Section 
2, but that was based on a very small sample.)  We 
decided to explore ways to refine the data in order 
to arrive at a smaller, but less noisy subcorpus. We 
observed that different subgenres within the news-
paper corpus evinced the lead-entails-headline 
quality to different degrees.  For example, articles 
about sports or entertainment often had whimsical 
(non-entailed) headlines, while articles about poli-
tics or business more frequently had the headline 
quality we sought. 
Accordingly, we decided to treat the data re-
finement process as a text classification problem, 
one of finding the mix of genres or topics that 
would most likely posess the lead-entails-headline 
quality. We used SVM-light (Joachims, 202) as a 
document classifier, training it on the initial set of 
annotated articles. (Note that these text classifica-
tion experiments made use of the entire article, not 
just the lead and headline.)  We experimented with 
a variety of feature representations and SVM pa-
rameters, but found the best performance with a 
Boolean bag-of-words representation, and a simple 
linear kernel.  Leave-one-out estimates indicate 
that SVM-light could identify documents with the 
requisite entailment quality with 7% accuracy. 
We performed one round of active learning 
(Tong & Koler, 200), in which we used SVM-
light to classify a large subset of the unannotated 
corpus, and then selected a 10-document subset 
about which the classifier was least certain.  The 
rationale is that annotating these uncertain docu-
ments wil be more informative to further learning 
runs than a randomly selected subset.  In the case 
of large-margin classifiers like SVMs, the natural 
choice is to select the instances closest to the mar-
gin. These were then annotated, and added back to 
the training data for the next learning run.  How-
ever, leave-one-out estimates indicated that the 
classifier benefited litle from these new instances. 
As described above, we estimate that the base 
rate of the headline entailment property in the XIE 
portion of Gigaword is 70%.  Our hypothesis in 
training the SVM was that we could identify a 
smaller but less noisy subset. In order to evaluate 
this, we ran the trained SVM on all 679,00 of the 
unannotated XIE documents, and selected the 
10,00 “best” instances—that is, the documents 
most likely (according to the SVM) to evince the 
headline quality.  We selected a random subset of 
these best documents, and annotated them to 
evaluate our hypothesis.  74% of these posessed 
the lead-entails-headline property, a difference of 
4% absolute over the XIE base rate. We used the 
lead-headline pairs  from this 10,00-best subset 
to train our MT-alignment-based system for the 
RTE evaluation (Bayer et al., 205).  This system 
was one of the best performers in the evaluation, 
which we ascribe to our large training corpus 
Later examination showed that the 4% “im-
provement” in purity is not statistically significant. 
We intend to perform further experiments in data 
refinement, but this may prove unecessary.  Per-
haps the base rate of the entailment phenomenon in 
the XIE documents is sufficient to train an effec-
tive alignment-based entailment system.  In this 
case, al of the XIE documents could be used, per-
haps resulting in a more robust, and even better 
performing system. 
4Judging Headline Entailments 
In the process of generating the training data, we 
doubly-judged an additional 30 XIE documents to 
measure inter-judge reliability.  As in the pilot 
phase described above, each pair was labeled as 
yes, no, or maybe. In addition, the judges were 
given a comment field to record their reasoning 
and misgivings.  The judging was performed in 
two steps, first on a set of 10 documents and then 
on a set of 20.  One of the judges was already 
well versed in the RTE task, and had performed the 
earlier pilot investigations.  Prior to judging the 
first set, the second judge was given a brief verbal 
52
overview of the task.  After the first 10 docu-
ments had been doubly-judged, the more experi-
enced judge then reviewed the differences and 
drafted a set of guidelines.  The guidelines pro-
vided a synopsis of the official RTE guidelines, 
plus a few rules unique to headlines. For example, 
one rule specified what to do when partial entail-
ment only held if the lead were combined with lo-
cation or date information from the dateline. The 
two evaluators then judged the second set.  The 
results for both sets are shown in Figure 5. 
As these results show, the guidelines had only a 
small effect on the strict measure of agreement. 
Three problem areas existed: 
(1) Raw, messy data. The Gigaword corpus was 
automatically colected and zoned. Thus, the head-
lines in particular contained a number of irregulari-
ties that made it difficult to judge their 
appropriateness. Such irregularities included trun-
cations, phrases lacking any proposition, pre-
pended alerts like URGENT:, and bylines and date 
lines miszoned into the headline. 
(2)  Disagreement on what constitutes synon-
ymy.   Our judges found they had irreconcilable 
differences about differences in meaning.  For ex-
ample, in the folowing pair, the judges disagreed 
about whether safe operation in the lead paragraph 
meant the same thing as, and thus entailed, oper-
ates smothly in the headline: 
• Shanghai's Hongqiao Airport Operates Smothly 
• As of Saturday, Shanghai's Hongqiao Airport 
has performed safe operation for some 2,60 
consecutive days, seting a record in the country. 
(3) Disagreement on the amount of world 
knowledge permited.  Figure 5 shows that if 
maybe is counted as equivalent to yes, the agree-
ment level improves significantly.  This is likely 
because there were two important aspects of the 
RTE definition of entailment that were not im-
parted to the second judge until the writen guide-
lines:  that one can assume “common human 
understanding of language and some common 
background knowledge.”  However, our judges did 
not always agree on what counts as “common,” 
which accounts for much of the high overlap be-
tween yes and maybe.  Nevertheless, our 90% 
agreement compares favorably to the 83% agree-
ment rate reported by Dolan et al. (205) for their 
judgments on “more or less semantically equiva-
lent” pairs. Our 78% strict agreement compares 
favorably to the 80% agreement achieved by Da-
gan et al. (205), given that our data was messier 
than the pairs crafted for the RTE chalenge. 
Like Dagan et al. (205), we did not force reso-
lution on all disagreements.  Disagreements over 
synonymy and common knowledge result in irrec-
oncilable differences, because it is neither posible 
nor desirable to use guidelines to force a shared 
understanding of an uterance.  Thus, for the first 
set of data 15 (15%) of the pairs were left unrecon-
ciled.  In the second set, 42 (21%) were left un-
reconciled. Eleven (6%) of the irreconcilable pairs 
in the second set were due to confusion stemming 
from the telegraphic nature of headlines, which led 
to misunderstandings about how to judge truncated 
headlines (Chinese President Vows to Open New 
Chapters With) vs. headlines lacking propositions 
(subject headings like Mandela’s Speech) vs. well-
formed but terse headlines (Crackdown on Auto-
Mafia in Bulgaria).  
Despite the high number of irreconcilable pairs, 
one encouraging sign was evident from the com-
ment field. The judges’ comments revealed that on 
pairs where they disagreed on how to label the 
pair, they often agreed on what the problem was. 
Our experience in generating a training corpus, 
particularly the number of irreconcilable cases we 
encountered, raises an important isue, namely, the 
feasibility of semantic equivalence tasks. We sug-
gest that the optimum method for empirically 
modeling semantic equivalence is to capture the 
variation in human judgments.  Three judges 
would evaluate each pair, so that there would al-
ways be a tie breaker.  After reconciling for dis-
agreements arising from human error, each distinct 
judgment would become part of the data set.  We 
also recommend that where there is genuine dis-
agreement, the questionable portions of each pair 
be annotated in some way to capture the source of 
the problem, going one step further than the com-
ment field we found beneficial in our annotation 
interface.  The three judgments would result in a 
four way clasification of pairs: 
Condition Set 1 
(10 docs) 
Set 2 
(20 docs) 
strict match 75.0% 7.50% 
maybe = yes 79.0% 90.0% 
maybe = no 84.0% 81.0% 
maybe = * 8.0% 94.0% 
Figure 5: Agreement for Two XIE Data Sets 
53
TT = TRUE 
TF = Likely TRUE, but posibly FALSE 
TF = Likely FALSE, but posibly TRUE 
FFF = FALSE 
System developers could chose to train on all 
the data, or limit themselves to the TT/FFF cases. 
For evaluation purposes, the systems’ results on 
the TF/TFF pairs could be evaluated in light of 
the human variation, providing a more realistic 
measure of the complexity of the task. 
5Conclusion 
Given the number of natural language processing 
applications that require the ability to recognize 
semantic equivalence and entailment, there is an 
obvious need for both robust evaluation method-
ologies and adequate development and test data. 
We’ve described here our work in generating sup-
plemental training data for the recent Pascal RTE 
evaluation, with which we produced a competitive 
system. Some news corpora provide a rich source 
of exemplars, and an automatic document classifier 
can be used to reduce the noisiness of the data. 
There are lingering difficulties in achieving high 
inter-judge agreement in determining paraphrase 
and entailment, and we believe the best way to 
cope with this is to allow the data to reflect the 
variance that exists in cros-human judgments. 
Acknowledgments 
This paper reports on work suported by the 
MITRE Sponsored Research Program.  We would 
also like to extend our thanks to Sam Bayer, John 
Henderson and Alex Yeh for their invaluable sug-
gestions and comments. Our gratitude also goes to 
Laurie Damianos, who provided us with statistics 
on MiTAP’s resources and served as one of the 
evaluators in our inter-judge reliability study. 

References 
Samuel Bayer, John Burger, Lisa Fero, John 
Henderson, and Alexander Yeh, 205.  MITRE’s 
submisions to the EU Pascal RTE Chalenge. 
PASCAL Procedings of the First Chalenge Work-
shop, Recognizing Textual Entailment, 1–13 April, 
205, Southampton, U.K. 
John D. Burger, 204. MITRE’s Qanda at TREC-12. 
The Twelvth Text REtrieval Conference.  NIST Spe-
cial Publication SP 50–25. 
Ido Dagan, Oren Glickman, and Bernado Magnini, 
205.  The PASCAL recognizing textual entailment 
chalenge.  PASCAL Procedings of the First Chal-
lenge Workshop, Recognizing Textual Entailment, 
1–13 April, 205, Southampton, U.K. 
Laurie Damianos, Steve Wohlever, Robyn Kozierok, 
and Jay Ponte, 203. mitap for real users, real data, 
real problems. In Procedings of the Conference on 
Human Factors of Computing Systems (CHI 203), 
Fort Lauderdale, FL April 5–10. 
David Day, John Aberden, Lynete Hirschman, Robyn 
Kozierok, Patricia Robinson and Marc Vilain, 197. 
Mixed initiative development of language procesing 
systems. Procedings of the Fifth Conference on Ap-
plied Natural Language Procesing. 
David Day, Chad McHenry, Robyn Kozierok, Laurel 
Riek, 204.  Calisto: A configurable anotation 
workbench. International Conference on Language 
Resources and Evaluation. 
Bil Dolan, Chris Brocket., and Chris Quirk,  205. 
Microsoft Research Paraphrase Corpus. 
htp:/research.microsoft.com/research/nlp/msr_parap
hrase.htm 
David Graf, 203.  English Gigaword. 
htp:/ww.ldc.upen.edu/Catalog/CatalogEntry.jsp?
catalogId=LDC203T05 
Dan Gusfield, 197. Algorithms on Strings, Tres and 
Sequences. Cambridge University Pres. 
Lynete Hirschman, Marc Light, Eric Breck, and John 
D. Burger, 199. Dep Read: A reading comprehen-
sion system. Procedings of the 37th anual meting 
of the Asociation for Computational Linguistics. 
Thorsten Joachims, 202.  Learning to Clasify Text 
Using Suport Vector Machines. Kluwer. 
Guido Minen, John Carol, and Daren Pearce, 201. 
Aplied morphological procesing of English. Natu-
ral Language Enginering, 7(3). 
Franz Josef Och and Herman Ney, 203. A systematic 
comparison of various statistical alignment models. 
Computational Linguistics, 29(1). 
Simon Tong and Daphne Koler, 200.  suport vector 
machine active learning with aplications to text 
clasification.  Procedings of ICML-0, 17th Inter-
national Conference on Machine Learning. 
Ben Welner, Lisa Fero, Waren Greif, and Lynete 
Hirschman, 205. Reading comprehension tests for 
computer-based understanding evaluation. Natural 
Language Enginering (to appear).
