Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pages 1–6,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Classification of semantic relations by humans and machines ∗
Erwin Marsi and Emiel Krahmer
Communication and Cognition
Tilburg University, The Netherlands
{e.c.marsi, e.j.krahmer}@uvt.nl
Abstract
This paper addresses the classification of
semantic relations between pairs of sen-
tences extracted from a Dutch parallel cor-
pus at the word, phrase and sentence level.
We first investigate the performance of hu-
man annotators on the task of manually
aligning dependency analyses of the re-
spective sentences and of assigning one
of five semantic relations to the aligned
phrases (equals, generalizes, specifies, re-
states and intersects). Results indicate that
humans can perform this task well, with
an F-score of .98 on alignment and an F-
score of .95 on semantic relations (after
correction). We then describe and evalu-
ate a combined alignment and classifica-
tion algorithm, which achieves an F-score
on alignment of .85 (using EuroWordNet)
and an F-score of .80 on semantic relation
classification.
1 Introduction
An automatic method that can determine how two
sentences relate to each other in terms of seman-
tic overlap or textual entailment (e.g., (Dagan and
Glickman, 2004)) would be a very useful thing to
have for robust natural language applications. A
summarizer, for instance, could use it to extract
the most informative sentences, while a question-
answering system – to give a second example –
could use it to select potential answer string (Pun-
yakanok et al., 2004), perhaps preferring more spe-
cific answers over more general ones. In general, it
∗This work was carried out within the IMIX-IMOGEN (In-
teractive Multimodal Output Generation) project, sponsored by
the Netherlands Organization of Scientific Research (NWO).
is very useful to know whether some sentence S is
more specific (entails) or more general than (is en-
tailed by) an alternative sentence Sprime, or whether the
two sentences express essentially the same informa-
tion albeit in a different way (paraphrasing).
Research on automatic methods for recognizing
semantic relations between sentences is still rela-
tively new, and many basic issues need to be re-
solved. In this paper we address two such related is-
sues: (1) to what extent can human annotators label
semantic overlap relations between words, phrases
and sentences, and (2) what is the added value of
linguistically informed analyses.
It is generally assumed that pure string overlap
is not sufficient for recognizing semantic relations;
and that using some form of syntactic analysis may
be beneficial (e.g., (Herrera et al., 2005), (Vander-
wende et al., 2005)). Our working hypothesis is that
semantic overlap at the word and phrase levels may
provide a good basis for deciding the semantic re-
lation between sentences. Recognising semantic re-
lations between sentences then becomes a two-step
procedure: first, the words and phrases in the re-
spective sentences need to be aligned, after which
the relations between the pairs of aligned words and
phrases should be labeled in terms of semantic rela-
tions.
Various alignment algorithms have been devel-
oped for data-driven approaches to machine trans-
lation (e.g. (Och and Ney, 2000)). Initially work
focused on word-based alignment, but more and
more work is also addressing alignment at the higher
levels (substrings, syntactic phrases or trees), e.g.,
(Meyers et al., 1996), (Gildea, 2003). For our pur-
poses, an additional advantage of aligning syntac-
tic structures is that it keeps the alignment feasible
(as the number of arbitrary substrings that may be
aligned grows exponentially to the number of words
1
in the sentence). Here, following (Herrera et al.,
2005) and (Barzilay, 2003), we will align sentences
at the level of dependency structures. In addition,
we will label the alignments in terms of five basic
semantic relations to be defined below. We will per-
form this task both manually and automatically, so
that we can address both of the issues raised above.
Section 2 describes a monolingual parallel cor-
pus consisting of two Dutch translations, and for-
malizes the alignment-classification task to be per-
formed. In section 3 we report the results on align-
ment, first describing interannotator agreement on
this task and then the results on automatic alignment.
In section 4, then, we address the semantic relation
classification; again, first describing interannotator
results, followed by results obtained using memory-
based machine learning techniques. We end with a
general discussion.
2 Corpus and Task definition
2.1 Corpus
We have developed a parallel monolingual corpus
consisting of two different Dutch translations of the
French book “Le petit prince” (the little prince) by
Antoine de Saint-Exup´ery (published 1943), one by
Laetitia de Beaufort-van Hamel (1966) and one by
Ernst Altena (2000). For our purposes, this proved
to be a good way to quickly find a large enough
set of related sentence pairs, which differ semanti-
cally in interesting and subtle ways. In this work,
we used the first five chapters, with 290 sentences
and 3600 words in the first translation, and 277 sen-
tences and 3358 words in the second translation.
The texts were automatically tokenized and split into
sentences, after which errors were manually cor-
rected. Corresponding sentences from both trans-
lations were manually aligned; in most cases this
was a one-to-one mapping, but occasionally a sin-
gle sentence in one translation mapped onto two or
more sentences in the other: this occurred 23 times
in all five chapters. Next, the Alpino parser for
Dutch (e.g., (Bouma et al., 2001)) was used for part-
of-speech tagging and lemmatizing all words, and
for assigning a dependency analysis to all sentences.
The POS labels indicate the major word class (e.g.
verb, noun, adj, and adv). The dependency rela-
tions hold between tokens and are identical to those
used in the Spoken Dutch Corpus. These include de-
pendencies such as head/subject, head/modifier and
coordination/conjunction. If a full parse could not
be obtained, Alpino produced partial analyses col-
lected under a single root node. Errors in lemmati-
zation, POS tagging, and syntactic dependency pars-
ing were not subject to manual correction.
2.2 Task definition
The task to be performed can be described infor-
mally as follows: given two dependency analyses,
align those nodes that are semantically related. More
precisely: For each node v in the dependency struc-
ture for a sentence S, we define STR(v) as the sub-
string of all tokens under v (i.e., the composition of
the tokens of all nodes reachable from v). An align-
ment between sentences S and Sprime pairs nodes from
the dependency graphs for both sentences. Aligning
node v from the dependency graph D of sentence
S with node vprime from the graph Dprime of Sprime indicates
that there is a semantic relation between STR(v) and
STR(vprime), that is, between the respective substrings
associated with v and vprime. We distinguish five po-
tential, mutually exclusive, relations between nodes
(with illustrative examples):
1. v equals vprime iff STR(v) and STR(vprime) are literally
identical (abstracting from case). Example: “a
small and a large boa-constrictor” equals “a
large and a small boa-constrictor”;
2. v restates vprime iff STR(v) is a paraphrase of
STR(vprime) (same information content but differ-
ent wording). Example: “a drawing of a boa-
constrictor snake” restates “a drawing of a boa-
constrictor”;
3. v specifies vprime iff STR(v) is more specific than
STR(vprime). Example: “the planet B 612” specifies
“the planet”;
4. v generalizes vprime iff STR(vprime) is more specific
than STR(v). Example: “the planet” general-
izes “the planet B 612”;
5. v intersects vprime iff STR(v) and STR(vprime) share
some informational content, but also each ex-
press some piece of information not expressed
in the other. Example: “Jupiter and Mars” in-
tersects “Mars and Venus”
Figure 1 shows an example alignment with seman-
tic relations between the dependency structures of
2
hebben
komen
hebben
ik
ik op in in aanraking met
zo contact met in de loop van
veel
heel
persoon
serieus veel
massa
gewichtig
heel
leven
mijn
leven
het
manier
die mens
Figure 1: Dependency structures and alignment for the sentences Zo heb ik in de loop van mijn leven heel
veel contacten gehad met heel veel serieuze personen. (lit. ‘Thus have I in the course of my life very
many contacts had with very many serious persons’) and Op die manier kwam ik in het leven met massa’s
gewichtige mensen in aanraking.. (lit. ‘In that way came I in the life with mass-of weighty/important people
in touch’). The alignment relations are equals (dotted gray), restates (solid gray), specifies (dotted black),
and intersects (dashed gray). For the sake of transparency, dependency relations have been omitted.
two sentences. Note that there is an intuitive rela-
tion with entailment here: both equals and restates
can be understood as mutual entailment (i.e., if the
root nodes of the analyses corresponding S and Sprime
stand in an equal or restate relation, S entails Sprime and
Sprime entails S), if S specifies Sprime then S also entails Sprime
and if S generalizes Sprime then S is entailed by Sprime.
In remainder of this paper, we will distinguish two
aspects of this task: alignment is the subtask of pair-
ing related nodes – or more precise, pairing the to-
ken strings corresponding to these nodes; classifica-
tion of semantic relations is the subtask of labeling
these alignments in terms of the five types of seman-
tic relations.
2.3 Annotation procedure
For creating manual alignments, we developed a
special-purpose annotation tool which shows, side
by side, two sentences, as well as their respective
dependency graphs. When the user clicks on a node
v in the graph, the corresponding string (STR(v)) is
shown at the bottom. The tool enables the user to
manually construct an alignment graph on the basis
of the respective dependency graphs. This is done by
focusing on a node in the structure for one sentence,
and then selecting a corresponding node (if possible)
in the other structure, after which the user can select
the relevant alignment relation. The tool offers addi-
tional support for folding parts of the graphs, high-
lighting unaligned nodes and hiding dependency re-
lation labels.
All text material was aligned by the two authors.
They started with annotating the first ten sentences
of chapter one together in order to get a feel for
the task. They continued with the remaining sen-
tences from chapter one individually (35 sentences
and 521 in the first translation, and 35 sentences and
481 words in the second translation). Next, both
annotators discussed annotation differences, which
triggered some revisions in their respective annota-
tion. They also agreed on a single consensus annota-
tion. Interannotator agreement will be discussed in
the next two sections. Finally, each author annotated
two additional chapters, bringing the total to five.
3 Alignment
3.1 Interannotator agreement
Interannotator agreement was calculated in terms of
precision, recall and F-score (with β = 1) on aligned
3
(A1,A2) (A1prime,A2prime) (Ac,A1prime) (Ac,A2prime)
#real: 322 323 322 322
#pred: 312 321 323 321
#correct: 293 315 317 318
precision: .94 .98 .98 .99
recall: .91 .98 .98 .99
F-score: .92 .98 .98 .99
Table 1: Interannotator agreement with respect
to alignment between annotators 1 and 2 before
(A1,A2) and after (A1prime,A2prime) revision , and between
the consensus and annotator 1 (Ac,A1prime) and annota-
tor 2 (Ac,A2prime) respectively.
node pairs as follows:
precision = | Areal ∩ Apred | / | Apred | (1)
recall = | Areal ∩ Apred | / | Areal | (2)
F-score = (2 × prec × rec) / (prec + rec) (3)
where Areal is the set of all real alignments (the ref-
erence or golden standard), Apred is the set of all
predicted alignments, and Apred∩Areal is the set all
correctly predicted alignments. For the purpose of
calculating interannotator agreement, one of the an-
notations (A1) was considered the ‘real’ alignment,
the other (A2) the ‘predicted’. The results are sum-
marized in Table 1 in column (A1,A2).1
As explained in section 2.3, both annotators re-
vised their initial annotations. This improved their
agreement, as shown in column (A1prime,A2prime). In ad-
dition, they agreed on a single consensus annotation
(Ac). The last two columns of Table 1 show the re-
sults of evaluating each of the revised annotations
against this consensus annotation. The F-score of
.98 can therefore be regarded as the upper bound on
the alignment task.
3.2 Automatic alignment
Our tree alignment algorithm is based on the dy-
namic programming algorithm in (Meyers et al.,
1996), and similar to that used in (Barzilay, 2003).
It calculates the match between each node in de-
pendency tree D against each node in dependency
tree Dprime. The score for each pair of nodes only de-
pends on the similarity of the words associated with
the nodes and, recursively, on the scores of the best
1Note that since there are no classes, we can not calculate
change agreement rethe Kappa statistic.
matching pairs of their descendants. The node simi-
larity function relies either on identity of the lemmas
or on synonym, hyperonym, and hyponym relations
between them, as retrieved from EuroWordNet.
Automatic alignment was evaluated with the con-
sensus alignment of the first chapter as the gold
standard. A baseline was constructed by aligning
those nodes which stand in an equals relation to each
other, i.e., a node v in D is aligned to a node vprime
in Dprime iff STR(v) =STR(vprime). This baseline already
achieves a relatively high score (an F-score of .56),
which may be attributed to the nature of our mate-
rial: the translated sentence pairs are relatively close
to each other and may show a sizeable amount of lit-
eral string overlap. In order to test the contribution
of synonym and hyperonym information for node
matching, performance is measured with and with-
out the use of EuroWordNet. The results for auto-
matic alignment are shown in Table 2. In compari-
son with the baseline, the alignment algorithm with-
out use of EuroWordnet loses a few points on preci-
sion, but improves a lot on recall (a 200% increase),
which in turn leads to a substantial improvement on
the overall F-score. The use of EurWordNet leads to
a small increase (two points) on both precision and
recall, and thus to small increase in F-score. How-
ever, in comparison with the gold standard human
score for this task (.95), there is clearly room for
further improvement.
4 Classification of semantic relations
4.1 Interannotator agreement
In addition to alignment, the annotation procedure
for the first chapter of The little prince by two anno-
tators (cf. section 2.3) also involved labeling of the
semantic relation between aligned nodes. Interanno-
tator agreement on this task is shown Table 3, before
and after revision. The measures are weighted preci-
sion, recall and F-score. For instance, the precision
is the weighted sum of the separate precision scores
for each of the five relations. The table also shows
the κ-score. The F-score of .97 can be regarded as
the upper bound on the relation labeling task. We
think these numbers indicate that the classification
of semantic relations is a well defined task which
can be accomplished with a high level of interanno-
tator agreement.
4
Alignment : Prec : Rec : F-score:
baseline .87 .41 .56
algorithm without wordnet .84 .82 .83
algorithm with wordnet .86 .84 .85
Table 2: Precision, recall and F-score on automatic
alignment
(A1,A2) (A1prime,A2prime) (Ac,A1prime) (Ac,A2prime)
precision: .86 .96 .98 .97
recall: .86 .95 .97 .97
F-score: .85 .95 .97 .97
κ: .77 .92 .96 .96
Table 3: Interannotator agreement with respect to se-
mantic relation labeling between annotators 1 and 2
before (A1,A2) and after (A1prime,A2prime) revision , and
between the consensus and annotator 1 (Ac,A1prime)
and annotator 2 (Ac,A2prime) respectively.
4.2 Automatic classification
For the purpose of automatic semantic relation la-
beling, we approach the task as a classification prob-
lem to be solved by machine learning. Alignments
between node pairs are classified on the basis of the
lexical-semantic relation between the nodes, their
corresponding strings, and – recursively – on previ-
ous decisions about the semantic relations of daugh-
ter nodes. The input features used are:
• a boolean feature representing string identity
between the strings corresponding to the nodes
• a boolean feature for each of the five semantic
relations indicating whether the relation holds
for at least one of the daughter nodes;
• a boolean feature indicating whether at least
one of the daughter nodes is not aligned;
• a categorical feature representing the lexical se-
mantic relation between the nodes (i.e. the
lemmas and their part-of-speech) as found in
EuroWordNet, which can be synonym, hyper-
onym, or hyponym.2
To allow for the use of previous decisions, the
nodes of the dependency analyses are traversed in
a bottom-up fashion. Whenever a node is aligned,
the classifier assigns a semantic label to the align-
ment. Taking previous decisions into account may
2These three form the bulk of all relations in Dutch Eu-
roWordnet. Since no word sense disambiguation was involved,
we simply used all word senses.
Prec : Rec : F-score:
equals .93±.06 .95±.04 .94±.02
restates .56±.08 .78±.04 .65±.05
specifies n.a. 0 n.a.
generalizes .19±.06 .37±.09 .24±.05
intersects n.a. 0 n.a.
Combined: .62±.01 .70±.02 .64±.02
Table 4: Average precision, recall and F-score (and
SD) over all 5 folds on automatic classification of
semantic relations
cause a proliferation of errors: wrong classification
of daughter nodes may in turn cause wrong classifi-
cation of the mother node. To investigate this risk,
classification experiments were run both with and
without (i.e. using the annotation) previous deci-
sions.
Since our amount of data is limited, we used
a memory-based classifier, which – in contrast to
most other machine learning algorithms – performs
no abstraction, allowing it to deal with productive
but low-frequency exceptions typically occurring in
NLP tasks(Daelemans et al., 1999). All memory-
based learning was performed with TiMBL, version
5.1 (Daelemans et al., 2004), with its default set-
tings (overlap distance function, gain-ratio feature
weighting, k = 1).
The five first chapters of The little prince were
used to run a 5-fold cross-validated classification ex-
periment. The first chapter is the consensus align-
ment and relation labeling, while the other four were
done by one out of two annotators. The alignments
to be classified are those from to the human align-
ment. The baseline of always guessing equals – the
majority class – gives a precision of 0.26, a recall of
0.51, and an F-score of 0.36. Table 4 presents the re-
sults broken down to relation type. The combined F-
score of 0.64 is almost twice the baseline score. As
expected, the highest score goes to equals, followed
by a reasonable score on restates. Performance on
the other relation types is rather poor, with even no
predictions of specifies and intersects at all.
Faking perfect previous decisions by using the
annotation gives a considerable improvement, as
shown in Table 5, especially on specifies, general-
izes and intersects. This reveals that the prolifera-
tion of classification errors is indeed a problem that
should be addressed.
5
Prec : Rec : F-score:
equals .99±.02 .97±.02 .98±.01
restates .65±.04 .82±.04 .73±.03
specifies .60±.12 .48±.10 .53±.09
generalizes .50±.11 .52±.10 .50±.09
intersects .69±.27 .35±.12 .46±.16
Combined: .82±.02 .81±.02 .80±.02
Table 5: Average precision, recall and F-score (and
SD) over all 5 folds on automatic classification of
semantic relations without using previous decisions.
In sum, these results show that automatic classifi-
cation of semantic relations is feasible and promis-
ing – especially when the proliferation of classifica-
tion errors can be prevented – but still not nearly as
good as human performance.
5 Discussion and Future work
This paper presented an approach to detecting se-
mantic relations at the word, phrase and sentence
level on the basis of dependency analyses. We inves-
tigated the performance of human annotators on the
tasks of manually aligning dependency analyses and
of labeling the semantic relations between aligned
nodes. Results indicate that humans can perform this
task well, with an F-score of .98 on alignment and an
F-score of .92 on semantic relations (after revision).
We also described and evaluated automatic methods
addressing these tasks: a dynamic programming tree
alignment algorithm which achieved an F-score on
alignment of .85 (using lexical semantic information
from EuroWordNet), and a memory-based seman-
tic relation classifier which achieved F-scores of .64
and .80 with and without using real previous deci-
sions respectively.
One of the issues that remains to be addressed
in future work is the effect of parsing errors. Such
errors were not corrected, but during manual align-
ment, we sometimes found that substrings could not
be properly aligned because the parser had failed to
identify them as syntactic constituents. As far as
classification of semantic relations is concerned, the
proliferation of classification errors is an issue that
needs to be solved. Classification performance may
be further improved with additional features (e.g.
phrase length information), optimization, and more
data. Also, we have not yet tried to combine au-
tomatic alignment and classification. Yet another
point concerns the type of text material. The sen-
tence pairs from our current corpus are relatively
close, in the sense that both translations more or less
convey the same information. Although this seems a
good starting point to study alignment, we intend to
continue with other types of text material in future
work. For instance, in extending our work to the ac-
tual output of a QA system, we expect to encounter
sentences with far less overlap.

References
R. Barzilay. 2003. Information Fusion for Multidocu-
ment Summarization. Ph.D. Thesis, Columbia Univer-
sity.
G. Bouma, G. van Noord, and R. Malouf. 2001. Alpino:
Wide-coverage computational analysis of Dutch. In
Computational Linguistics in The Netherlands 2000,
pages 45–59.
W. Daelemans, A. Van den Bosch, and J. Zavrel. 1999.
Forgetting exceptions is harmful in language learning.
Machine Learning, Special issue on Natural Language
Learning, 34:11–41.
W. Daelemans, J. Zavrel, K. Van der Sloot, and
A. van den Bosch. 2004. TiMBL: Tilburg memory
based learner, version 5.1, reference guide. ILK Tech-
nical Report 04-02, Tilburg University.
I. Dagan and O. Glickman. 2004. Probabilistic textual
entailment: Generic applied modelling of language
variability. In Learning Methods for Text Understand-
ing and Mining, Grenoble.
D. Gildea. 2003. Loosely tree-based alignment for ma-
chine translation. In Proceedings of the 41st Annual
Meeting of the ACL, Sapporo, Japan.
J. Herrera, A. Pe nas, and F. Verdejo. 2005. Textual
entailment recognition based on dependency analy-
sis and wordnet. In Proceedings of the 1st. PASCAL
Recognision Textual Entailment Challenge Workshop.
Pattern Analysis, Statistical Modelling and Computa-
tional Learning, PASCAL.
A. Meyers, R. Yangarber, and R. Grisham. 1996. Align-
ment of shared forests for bilingual corpora. In Pro-
ceedings of 16th International Conference on Com-
putational Linguistics (COLING-96), pages 460–465,
Copenhagen, Denmark.
F.J. Och and H. Ney. 2000. Statistical machine trans-
lation. In EAMT Workshop, pages 39–46, Ljubljana,
Slovenia.
V. Punyakanok, D. Roth, and W. Yih. 2004. Natural lan-
guage inference via dependency tree mapping: An ap-
plication to question answering. Computational Lin-
guistics, 6(9).
L. Vanderwende, D. Coughlin, and W. Dolan. 2005.
What syntax can contribute in entailment task. In Pro-
ceedings of the 1st. PASCAL Recognision Textual En-
tailment Challenge Workshop, Southampton, U.K.
