Semantic Coherence Scoring Using an Ontology
Iryna Gurevych Rainer Malaka Robert Porzel Hans-Peter Zorn
European Media Lab GmbH
Schloss-Wolfsbrunnenweg 31c
D-69118 Heidelberg, Germany
a0 gurevych,malaka,porzel,zorn@eml.org
a1
Abstract
In this paper we present ONTOSCORE, a sys-
tem for scoring sets of concepts on the basis
of an ontology. We apply our system to the
task of scoring alternative speech recognition
hypotheses (SRH) in terms of their semantic
coherence. We conducted an annotation exper-
iment and showed that human annotators can
reliably differentiate between semantically co-
herent and incoherent speech recognition hy-
potheses. An evaluation of our system against
the annotated data shows that, it successfully
classifies 73.2% in a German corpus of 2.284
SRHs as either coherent or incoherent (given a
baseline of 54.55%).
1 Introduction
Following Allen et al. (2001), we can distinguish be-
tween controlled and conversational dialogue systems.
Since controlled and restricted interactions between the
user and the system increase recognition and understand-
ing accuracy, such systems are reliable enough to be de-
ployed in various real world applications, e.g. public
transportation or cinema information systems. The more
conversational a dialogue system becomes, the less pre-
dictable are the users’ utterances. Recognition and pro-
cessing become increasingly difficult and unreliable.
Today’s dialogue systems employ domain- and
discourse-specific knowledge bases, so-called ontologies,
to represent the individual discourse entities as concepts,
and their relations to each other. In this paper we present
an algorithm for measuring the semantic coherence of
sets of concepts against such an ontology. In the fol-
lowing, we will show how the semantic coherence mea-
surement can be applied to estimate how well a given
speech recognition hypothesis (SRH) fits with respect to
the existing knowledge representation, thereby providing
a mechanism that increases the robustness and reliability
of dialogue systems.
In Section 2 we discuss the problem of scoring and
classifying SRHs in terms of their semantic coherence
followed by a description of our annotation experiment.
Section 3 contains a description of the kind of knowledge
representations employed by ONTOSCORE. We present
the algorithm in Section 4, and an evaluation of the cor-
responding system for scoring SRHs is given in Section
5. A conclusion and additional applications are given in
Section 6.
2 Semantic Coherence and Speech
Recognition Hypotheses
2.1 The Problem
While a simple one-best hypothesis interface between au-
tomatic speech recognition (ASR) and natural language
understanding (NLU) suffices for restricted dialogue sys-
tems, more complex systems either operate on n-best lists
as ASR) output or convert ASR word graphs (Oerder
and Ney, 1993) into n-best lists, given the distribution of
acoustic and language model scores (Schwartz and Chow,
1990; Tran et al., 1996). For example, in our data a user
expressed the wish to see a specific city map again, as:1
(1) Ich
I
w¨urde
would
die
the
Karte
map
gerne
like
wiedersehen
to see again
Looking at two SRHs from the ensuing n-best list we
found that Example (1a) constituted a suitable represen-
tation of the utterance, whereas Example (1b) constituted
a less adequate representation thereof.
(1a) Ich
I
w¨urde
would
die
the
Karte
map
eine
one
wieder
again
sehen
see
(1b) Ich
I
w¨urde
would
die
the
Karte
map
eine
one
Wiedersehen
Good Bye
Facing multiple representations of a single utterance
consequently poses the question, which of the different
hypotheses corresponds most likely to the user’s utter-
ance. Several ways of solving this problem have been
1All examples are displayed with the German original on top
and a glossed translation below.
                                                               Edmonton, May-June 2003
                                                                Main Papers , pp. 9-16
                                                         Proceedings of HLT-NAACL 2003
proposed and implemented in various systems. Fre-
quently the scores provided by the ASR system itself
are used, e.g. acoustic and language model probabilities.
More recently also scores provided by the NLU system
have been employed, e.g. parsing scores or discourse
scores (Litman et al., 1999; Engel, 2002; Alexanders-
son and Becker, 2003). However, these methods assign
higher scores to SRHs which are semantically incoher-
ent and lower scores to semantically coherent ones and
disagree with other.
For instance, the acoustic and language model scores
of Example (1b) are actually better than for Example (1a),
which results from the fact that the frequencies and cor-
responding probabilities for important expressions, such
as Good Bye, are rather high, thereby ensuring their reli-
able recognition. Another phenomenon found in our data
consists of hypotheses such as:
(2) Zeige
Show
mir
me
alle
all
Vergn¨ugen
pleasures
(3) Zeige
Show
mir
me
alle
all
Filmen
Films
In these cases language model scores are higher for Ex-
ample (2) than Example (3), as the incorrect inflection on
alle Filmen was less frequent in the training material than
that of the correct inflection on alle Vergn¨ugen.
Our data also shows - as one would intuitively ex-
pect - that the understanding-based scores generally re-
flect how well a given SRH is covered by the grammar
employed. In many less well-formed cases these scores
do not correspond to the correctness of the SRH. Gener-
ally we find instances where all existing scoring methods
disagree with each other, diverge from the actual word er-
ror rate and ignore the semantic coherence.2 Neither of
the aforementioned approaches systematically employs
the system’s knowledge of the domains at hand. This in-
creases the number of times where a suboptimal recogni-
tion hypothesis is passed through the system. This means
that, while there was a better representation of the actual
utterance in the n-best list, the NLU system is processing
an inferior one, thereby causing overall dialogue metrics,
in the sense of Walker et al. (2000), to decrease. We pro-
pose an alternative way to rank SRHs on the basis of their
semantic coherence with respect to a given ontology rep-
resenting the domains of the system.
2.2 Annotation Experiments
In a previous study (Gurevych et al., 2002), we tested if
human annotators could reliably classify SRHs in terms
2As the numbers evident from large vocabulary speech
recognition performance (Cox et al., 2000), the occurrence of
less well formed and incoherent SRHs increases the more con-
versational a system becomes.
of their semantic coherence. The task of the annotators
was to determine whether a given hypothesis representsa
n internally coherent utterance or not.
In order to test the reliability of such annotations, we
collected a corpus of SRHs. The data collection was con-
ducted by means of a hidden operator test. We had 29
subjects prompted to say certain inputs in 8 dialogues.
1.479 turns were recorded. Each user-turn in the dialogue
corresponded to a single intention, e.g. a route request or
a sight information request. The audio files were then
sent to the speech recognizer and the input to the seman-
tic coherence scoring module, i.e. n-best lists of SRHs
were recorded in log-files. The final corpus consisted of
2.284 SRHs. All hypotheses were then randomly mixed
to avoid contextual influences and given to separate an-
notators. The resulting Kappa statistics (Carletta, 1996)
over the annotated data yields a0a2a1 a3a5a4a7a6 , which seems to
indicate that human annotators can reliably distinguish
between coherent samples (as in Example (1a)) and inco-
herent ones (as in Example (1b)).
The aim of the work presented here, then, was to pro-
vide a knowledge-based score, that can be employed by
any NLU system to select the best hypothesis from a
given n-best list. ONTOSCORE, the resulting system will
be described below, followed by its evaluation against the
human gold standard.
3 The Knowledge Base
In this section, we provide a description of the pre-
existing knowledge source employed by ONTOSCORE,
as far as it is necessary to understand the empirical data
generated by the system. It is important to note that the
ontology employed in this evaluation existed already and
was crafted as a general knowledge representation for
various processing modules within the system.3
Ontologies have traditionally been used to represent
general and domain specific knowledge and are employed
for various natural language understanding tasks, e.g. se-
mantic interpretation (Allen, 1987). We propose an addi-
tional way of employing ontologies, i.e. to use the knowl-
edge modeled therein as the basis for evaluating the se-
mantic coherence of sets of concepts.
The system described herein can be employed indepen-
dently of the specific ontology language used, as the un-
derlying algorithm operates only on the nodes and named
edges of the directed graph represented by the ontology.
The specific knowledge base, e.g. written in DAML+OIL
or OWL,4 is converted into a graph, consisting of:
3Alternative knowledge representations, such as WORD-
NET, could have been employed in theory as well, however
most of the modern domains of the system, e.g. electronic me-
dia or program guides, are not covered by WORDNET.
4DAML+OIL and OWL are frequently used knowledge
modeling languages originating in W3C and Semantic Web
a0 the class hierarchy, with each class corresponding to
a concept representing either an entity or a process;
a0 the slots, i.e. the named edges of the graph corre-
sponding to the class properties, constraints and re-
strictions.
The ontology employed herein has about 730 concepts
and 200 relations. It includes a generic top-level ontol-
ogy whose purpose is to provide a basic structure of the
world, i.e. abstract classes to divide the universe in dis-
tinct parts as resulting from the ontological analysis. The
top-level was developed following the procedure outlined
in Russell and Norvig (1995).
In the view of the ontology employed herein, Role
is the most general class in the ontology and represents
a role that any entity or process can perform. It is di-
vided into Event and Abstract Event. Event is
used to describe a kind of role any entity or process may
have in a ”real” situation or process, e.g. a building or
an information search. It is contrasted with Abstract
Event, which is abstracted from a set of situations and
processes. It reflects no reality and is used for the gen-
eral categorization and description, e.g. Number, Set,
Spatial Relation. There are two kinds of events:
Physical Object and Process.
The class Physical Object describes any kind of
objects we come in contact with - living as well as non-
living - having a location in space and time in contrast to
abstract objects. These objects refer to different domains,
such as Sight and Route in the tourism domain, Av
Medium and Actor in the TV and cinema domain, etc.,
and can be associated with certain relations in the pro-
cesses via slot constraint definitions.
The modeling of Process as a kind of event
that is continuous and homogeneous in nature, follows
the frame semantic analysis used for generating the
FRAMENET data (Baker et al., 1998). Currently, there
are four groups of processes (see Figure 1):
a0 General Process, a set of the most general pro-
cesses such as duplication, imitation or repetition
processes;
a0 Mental Process, a set of processes such as cog-
nitive, emotional or perceptual processes;
a0 Physical Process, a set of processes such as
motion, transaction or controlling processes;
a0 Social Process, a set of processes such as
communication or instruction processes.
Let us consider the definition of the Information
Search Process in the ontology. It is modeled as a
projects. For more detail, see www.w3c.org.
subclass of the Cognitive Process, which is a sub-
class of the Mental Process and inherits the follow-
ing slot constraints:
a0 begin time, a time expression indicating the starting
time point;
a0 end time, a time expression indicating the time
point when the process is complete;
a0 state, one of the abstract process states, e.g. start,
continue, interrupt, etc.;
a0 cognizer, filled with a class Person including its
subclasses.
Information Search Process features one ad-
ditional slot constraint, piece-of-information. The possi-
ble slot-fillers are a range of domain objects, e.g. Sight,
Performance, or whole sets of those, e.g. Tv
Program, but also processes, e.g. Controlling Tv
Device Process. This way, an utterance such as:
(4) I
I
h¨atte gerne
would like
Informationen
information
zum
about
Schloss
castle
can also be mapped onto Information Search
Process, which has an agent of type User and a piece
of information of type Sight. Sight has a name of
type Castle. Analogously, the utterance:
(5) Wie
How
kann
can
ich
I
den
the
Fernseher
TV
steuern
control
can be mapped onto Information Search
Process, which has an agent of type User and
has a piece of information of type Controlling Tv
Device Process.
4 Ontology-based Scoring of SRHs
ONTOSCORE performs a number of processing steps,
each of them will be described separately in the respec-
tive subsections.
4.1 Mapping of SRH to Sets of Concepts
A necessary preprocessing step is to convert each SRH
into a concept representation (CR). For that purpose we
augmented the system’s lexicon with specific concept
mappings. That is, for each entry in the lexicon either
zero, one or many corresponding concepts where added.
A simple vector of the concepts, corresponding to the
words in the SRH for which concepts in the lexicon exist,
constitutes the resulting CR. All other words with empty
concept mappings, e.g. articles, are ignored in the con-
version.
Process
Static Spatial Process
Transaction Process
Emotion Process
Social Process
Controlling Process
Verification Process
Motion Process
Presentation Process
Hear Perceptual Process
General Process
Perceptual Process
Emotion Experiencer Emotion Experiencer 
Subject Process Object Process
Planning Process
Cognitive Process
Mental Process
Information Search 
Process
Controlling Commu− Controlling Presen−
Physical Process
Communicative 
Process
Instructive ProcessProcessControlling Device 
Abstract Reset Process
Process
Abstract Replacement 
Abstract Repetition Process
Abstract Imitation Process
Controlling 
Representational
Artifact
nication Device tainment Device
Controlling Enter−
tation Device
Controlling Media 
Process
Emotion Active
Process
Emotion Directed
Process
Abstract Duplication 
Process
Watch Perceptual Process
Figure 1: Upper part of the process hierarchy.
4.2 Mapping of CR to Graphs
ONTOSCORE converts the domain model, i.e. an ontol-
ogy, into a directed graph with concepts as nodes and re-
lations as edges. One additional problem that needed to
be solved lies in the fact that the directed subclass-of rela-
tions enable path algorithms to ascend the class hierarchy
upwards, but do not let them descend, therefore missing
a significant set of possible paths. In order to remedy that
situation the graph was enriched during its conversion by
corresponding parent-of relations, which eliminated the
directionality problems as well as avoids cycles and 0-
paths. In order to find the shortest path between two con-
cepts, ONTOSCORE employs the single source shortest
path algorithm of Dijkstra (Cormen et al., 1990).
Given a concept representation CR a0a2a1a4a3 , ..., a1a6a5a8a7 , the
algorithm runs once for each concept. The Dijkstra algo-
rithm calculates minimal paths from a source node to all
other nodes. Then, the minimal paths connecting a given
concept a1a10a9 with every other concept in CR (excluding a1a11a9
itself) are selected, resulting in an a12a14a13a15a12 matrix of the
respective paths.
4.3 The Scoring Algorithm
To score the minimal paths connecting all concepts with
each other in a given CR, we first adopted a method pro-
posed by Demetriou and Atwell (1994) to score the se-
mantic coherence of alternative sentence interpretations
against graphs based on the Longman Dictionary of Con-
temporary English (LDOCE). To construct the graph the
dictionary lemmata were represented as nodes in an isa
hierarchy and their semantic relations were represented
as edges, which were extracted automatically from the
LDOCE.
As defined by Demetriou and Atwell (1994), a16 a1
a0a11a17 a3a4a18a17a11a19 a18 a4 a4 a4 a18a17 a5 a7 is the set of direct relations (both isa
and semantic relations) that can connect two nodes (con-
cepts); and a20 a1 a0a2a21 a3a22a18a21a23a19 a18 a4 a4 a4 a18a21 a5 a7 is the set of corre-
sponding weights, where the weight of each isa relation
is set to a3 and that of each other relation to a24 . For each
two conceptsa1 a9 , a1a26a25 the set a27 a1 a0a29a28 a3a2a18a28a30a19 a18 a4 a4 a4 a18a28a8a31a32a7 denotes
the scores of all possible paths that link the two concepts.
The score for path a33a35a34a36a33 a1a37a24
a18 a4 a4 a4 a18a29a38a40a39 can be given as:
a28a42a41
a1
a5
a43
a9a45a44a46a3a48a47
a9a49a21a23a9
where
a47
a9 represents the number of times the relation a17a4a9
exists in path a33 . The ensuing distance between two con-
cepts a1 a9 and a1a26a25 is, then, defined as the minimum score
derived between a1a10a9 and a1 a25 , i.e.:
a0
a34
a1 a9a18 a1a26a25 a39
a1
a38a2a1
a12 a34
a28 a41a22a39
a33 a1 a24
a18a4a3 a18 a4 a4 a4 a18a29a38
The algorithm selects from the set of all paths between
two concepts the one with the smallest weight, i.e. the
cheapest. The distances between all concept pairs in CR
are summed up to a total score. The set of concepts
with the lowest aggregate score represents the combina-
tion with the highest semantic relatedness.
Demetriou and Atwell (1994) do not provide concrete
evaluation results for the method. Also, their algorithm
only allows for a relative judgment stating which of a set
of interpretations given a single sentence is more seman-
tically related.
Since our objective is to compute semantic coherence
scores of arbitrary CRs on an absolute scale, certain ex-
tensions are necessary. In this application, the CRs to
be scored can differ in terms of their content, the num-
ber of concepts contained therein and their mappings to
the original SRH. Moreover, in order to achieve absolute
values, the final score should be related to the number of
concepts in an individual set and the number of words in
the original SRH. Therefore, the results must be normal-
ized in order to allow for evaluation, comparability and
clearer interpretation of the semantic coherence scores.
4.4 Scoring Concept Representations
We modified the algorithm described above to make it
applicable and evaluatable with respect to the task at
hand as well as other possible tasks. The basic idea is
to calculate a score based on the path distances in a5 a16 .
Since short distances indicate coherence and many con-
cept pairs in a given a5 a16 may have no connecting path,
we define the distance between two concepts a1a11a9 and a1 a25
that are only connected via isa relations in the knowledge
base as a0 a31a7a6a9a8 . This maximum value can also serve as a
maximum for long distances and can thus help to prune
the search tree for long paths. This constant has to be
set according to the structure of the knowledge base. For
example, employing the ontology described above, the
maximum distance between two concepts does not ex-
ceed ten and we chose in that case a0 a31a7a6a9a8 a1 a24 a3 .
We can now define the semantic coherence score for
a5 a16 as the average path length between all concept pairs
in a5 a16 :
a10
a34a11a5 a16
a39
a1
a12a14a13a16a15a16a17a13a19a18a21a20a23a22a25a24a26a17a13a16a15a28a27
a44
a13a19a18 a0
a34
a1a6a9a18 a1 a25 a39
a29
a5 a16
a29
a19a31a30
a29
a5 a16
a29
Since the ontology is a directed graph, we have
a29
a5 a16
a29
a19 a30
a29
a5 a16
a29 pairs of concepts with possible directed
connections, i.e., a path from concept a1a2a9 to concept a1 a25
may be completely different to that from a1 a25 to a1a6a9 or even
be missing. As a symmetric alternative, we may want to
consider a path from a1a10a9 to a1 a25 and a path from a1 a25 to a1a10a9 to
be semantically equivalent and thus model every relation
in a bidirectional way. We can then compute a symmetric
score a10a33a32 a34a11a5 a16 a39 as:
a10 a32
a34a11a5 a16
a39
a1
a3
a12a14a13a16a15a16a17a13a19a18a9a20a23a22a34a24a26a17
a9a36a35 a25
a38a2a1
a12 a34
a0
a34
a1 a9a18 a1a26a25 a39 a18
a34
a0
a34
a1 a25 a18a1 a9a39a39
a29
a5 a16
a29
a19a7a30
a29
a5 a16
a29
ONTOSCORE implements both options. In the ontol-
ogy currently employed by the system some reverse re-
lations can be found, e.g. given a1a4a3 =Broadcast and
a1 a19 =Channel, there exists a path from a1a2a3 to a1 a19 via the
relation has-channel and a different path from a1 a19 to a1a2a3
via the relation has-broadcast. However, such reverse
relations are only sporadically represented in the ontol-
ogy. Consequently, it is difficult to account for their in-
fluence on a10 a34a11a5 a16 a39 in general. That is why we chose the
a10a33a32
a34a11a5 a16
a39 function for the evaluation, i.e. only the best path
a0
a34
a1a6a9a18 a1 a25 a39 between a given pair of concepts, regardless of
the direction, is taken into account.
4.5 Word/Concept Relation
Given the algorithm proposed above, a significant num-
ber of misclassifications for SRHs would result from the
cases when an SRH contains a high proportion of func-
tion words (having no conceptual mappings in the result-
ing CR) and only a few content words. Let’s consider the
following example:
(6) Wo
Where
den
the
Informationen
information
zu
to
das
the
gleiche
same
The corresponding CR is constituted out of a
single concept Information Search Process.
ON TOSCORE would classify the CR as coherent with the
highest possible score, as this is the only concept in the
set. This, however, would often lead to misclassifications.
We, therefore, included a post-processing technique that
takes the relation between the number of ontology con-
cepts a37
a13
in a given CR and the total number of words a37a39a38
in the original SRH into account. This relation is defined
by the ratio a40 a1a41a37
a13a9a42
a37a43a38 . ONTOSCORE automatically
classifies an SRH as being incoherent irrespective of its
semantic coherence score, if a40 is less then the threshold
set. The threshold may be set freely. The corresponding
findings are presented in the evaluation section.
4.6 ONTOSCORE at Work
Looking at an example of ONTOSCORE at work, we
will examine the utterance given in Example (1). The
resulting two SRHs - a10 a16a45a44 a3 and a10 a16a45a44 a19 - are given in
Example (1a) and (1b) respectively. The human annota-
tors considered a10 a16a46a44 a3 to be coherent and labeled a10 a16a46a44 a19
as incoherent. According to the concept entries in the
lexicon, the SRHs are transformed into two alternative
concept representations. As no ambiguous words are
found in this example, a5 a16 a3 corresponds to a10 a16a45a44 a3 and
a5 a16
a19 corresponds to a10
a16a45a44
a19 :
a5 a16
a3 : a0 Person; Map; Watch Perceptual
Processa7 ;
a5 a16
a19 : a0 Person; Map; Parting Processa7 .
They are converted into a graph. According to the algo-
rithm shown in Section 4.3, all paths between the con-
cepts of each graph are calculated and weighted. This
yields the following non-a0 a31a7a6a9a8 paths:
a5 a16
a3 a0
a34a1a0a3a2a5a4a7a6a9a8a11a10a13a12a5a14a7a6a5a12a5a15a11a4a17a16a3a2a11a18a5a10a19a14a19a20a3a6a5a12a11a21a19a21
a18
a10a13a12a5a14a22a21a5a20a24a23
a39
a1a37a24
via the relation has-watcher;
a0
a34a1a0a3a2a5a4a7a6a9a8a11a10a13a12a5a14a7a6a5a12a5a15a11a4a17a16a3a2a11a18a5a10a19a14a19a20a3a6a5a12a11a21a19a21
a18a26a25
a2a24a15
a39
a1a37a24
via the relation has-watchable object.
a5 a16
a19 a0
a34a27a10a13a2a5a14a19a4a29a28a30a23a11a31a32a10a19a14a13a20a3a6a5a12a13a21a19a21
a18
a10a13a12a17a14a22a21a24a20a24a23
a39
a1 a24
via the relation has-agent;
The ensuing results are:
According to a10 According to a10 a32
a10
a34a11a5 a16
a3 a39
a1
a6 a10a33a32
a34a11a5 a16
a3 a39
a1a34a33
a10
a34a11a5 a16
a19 a39
a1a34a35
a4a37a36 a10a33a32
a34a11a5 a16
a19 a39
a1
a6
In both cases the results are sufficient for a relative judg-
ment, i.e. a10 a16a45a44 a19 constitutes a less semantically coherent
structure as a10 a16a45a44 a3 . To allow for a binary classification
into semantically coherent vs. incoherent samples, a cut-
off threshold must be set. The results of the correspond-
ing experiments will be presented in Section 5.2.
4.7 Word Sense Disambiguation
Due to lexical ambiguity, the process of transforming an
n-best list of SRH to concept representations often re-
sults in a set of CRs that is greater than 1, i.e. a given
SRH could be transformed into a set of CRs a0 a5 a16 a3 , ...,
a5 a16
a5a42a7 . Word sense disambiguation could, therefore, also
independently be performed using the semantic coher-
ence scoring described herein as an additional application
of our approach. However, that has not been investigated
thoroughly yet.
For example, lexicon entries for the words:
I - Person
am - Static Spatial Process,
Self Identification Process, None
on - Two Point Relation, None
the - None
Philosopher’s Walk - Location
yield a set of interpretations for an SRH such as:
(7) Ich
I
bin
am
auf
on
dem
the
Philosophenweg
Philosopher’s Walk
a38a40a39a42a41a44a43 Person, Static Spatial Process,
Locationa45
a38a40a39a47a46a48a43 Person, Static Spatial Process, Two
Point Relation, Locationa45
a38a40a39a50a49a51a43 Person, Self Identification
Process, Locationa45
a38a40a39a50a52a51a43 Person, Self Identification
Process, Two Point Relation, Locationa45
a38a40a39a47a53a48a43 Person, Two Point Relation,
Locationa45
a38a40a39a50a54a51a43 Person, Location
a45
and corresponding final scores:
a55a57a56a27a38a40a39a42a41a59a58a61a60a63a62 ;
a55a57a56a27a38a40a39 a46 a58a61a60a48a64a66a65a67a69a68 ;
a55a57a56a27a38a40a39 a49 a58a61a60a48a70a66a65a71a9a64 ;
a55a57a56a27a38a40a39a50a52a72a58a61a60a48a70 ;
a55a57a56a27a38a40a39a47a53a73a58a61a60a48a64a66a65a64 ;
a55a57a56a27a38a40a39a50a54a72a58a61a60a75a74a77a76a42a60a51a78a80a79a82a81a73a83 ;
The examination of the resulting scores allows us to
conclude that a5 a16 a19 constitutes the most semantically co-
herent representation of the initial SRH, a5 a16 a3 and a5 a16a80a84
display a slightly lesser degree of semantic coherence,
whereas a5 a16a86a85 , a5 a16a80a87 and a5 a16a86a88 are much less coherent and
may, thus, be considered inadequate.
5 Evaluation
5.1 Context
The ONTOSCORE software runs as a module in
SMARTKOM (Wahlster et al., 2001), a multi-modal and
multi-domain spoken dialogue system. The system fea-
tures the combination of speech and gesture as its input
and output modalities. The domains of the system in-
clude cinema and TV program information, home elec-
tronic device control, mobile services for tourists, e.g.
tour planning and sights information.
ONTOSCORE operates on n-best lists of SRHs pro-
duced by the language interpretation module out of the
ASR word graphs. It computes a numerical ranking of
alternative SRH and thus provides an important aid to
the understanding component of the system in determin-
ing the best SRH. The ONTOSCORE software employs
two knowledge sources, an ontology (about 730 concepts
and 200 relations) and a word/concept lexicon (ca. 3.600
words), covering the respective domains of the system.
5.2 Results
The evaluation of ONTOSCORE was carried out on a
dataset of 2.284 SRHs. We reformulated the problem of
measuring the semantic coherence in terms of classify-
ing the SRHs into two classes: coherent and incoherent.
To our knowledge, there exists no similar software per-
forming semantic coherence scoring to be used for com-
parison in this evaluation. Therefore, we decided to use
the results from human annotation (s. Section 2.2) as the
baseline.
A gold standard for the evaluation of ONTOSCORE
was derived by the annotators agreeing on the correct so-
lution in cases of disagreement. This way, we obtained
1.246 (54.55%) SRH classified as coherent by humans,
which is also assumed to be the baseline for this evalua-
tion.
Additionally, we performed an inverse linear transfor-
mation of the scores (which range from 1 to a0 a31a31a6 a8 ), so
that the output produced by ONTOSCORE is a score on
the scale from 0 to 1, where higher scores indicate greater
coherence. In order to obtain a binary classification of
SRHs into coherent versus incoherent with respect to the
knowledge base, we set a cutoff thresh old. The depen-
dency graph of the threshold value and the results of the
program in % is shown in Figure 1.
Figure 2: Finding the optimal threshold for the coherent
versus incoherent classification
The best results are achieved with the threshold 0.29.
With this threshold, ONTOSCORE correctly classifies
1.487 SRH, i.e. 65.11% in the evaluation dataset (the
word/concept relation is not taken into account at this
point).
Figure 3 shows the dependency graph between a40 , rep-
resenting the threshold for the word/concept relation and
the results of ONTOSCORE, given the best cutoff thresh-
old for the classification (i.e. 0.29) derived in the previous
experiments.
The best results are achieved with the a40 a1 a3a5a4a0a1a0 . In
other words, the proportion of concepts vs. words must
be no less than 1 to 3. Under these settings, ONTOSCORE
correctly classifies 1.672 SRH, i.e. 73.2% in the evalua-
tion dataset. This way, the technique brings an additional
improvement of 8.09% as compared to initial results.
6 Concluding Remarks
The ONTOSCORE system described herein automatically
performs ontology-based scoring of sets of concepts con-
Figure 3: Finding the optimal threshold for the
word/concept relation
stituting an adequate representation of speech recogni-
tion hypotheses. To date, the algorithm has been im-
plemented in a software which is employed by a multi-
domain and multi-modal dialogue system and applied
to the task of scoring n-best lists of SRH, thus produc-
ing a score expressing how well a given SRH fits within
the domain model. For this task, it provides an alterna-
tive knowledge-based score next to the ones provided by
the ASR and the NLU system. In the evaluation of our
system we employed an ontology that was not designed
for this task, but already existed as the system’s internal
knowledge representation.
As future work we will examine how the computa-
tion of a discourse dependent semantic coherence score,
i.e. how well a given SRH fits within domain model
with respect to the previous discourse, can improve the
overall score. Additionally, we intend to calculate the
semantic coherence score with respect to individual do-
mains of the system, thus enabling domain recognition
and domain change detection in complex multi-modal
and multi-domain spoken dialogue systems. Currently,
we are also beginning to investigate whether the proposed
method can be applied to scoring sets of potential candi-
dates for resolving the semantic interpretation of ambigu-
ous, polysemous and metonymic language use.
Acknowledgments
This work has been partially funded by the German Fed-
eral Ministry of Research and Technology (BMBF) as
part of the SmartKom project under Grant 01 IL 905C/0
and by the Klaus Tschira Foundation. We would like to
thank Michael Strube for his helpful comments on the
previous versions of this paper.
References
Jan Alexandersson and Tilman Becker. 2003. The For-
mal Foundations Underlying Overlay. In Proceedings
of the Fifth International Workshop on Computational
Semantics (IWCS-5), Tilburg, The Netherlands, Febru-
ary.
James F. Allen, George Ferguson, and Amanda Stent.
2001. An architecture for more realistic conversational
system. In Proceedings of Intelligent User Interfaces,
pages 1–8, Santa Fe, NM.
James F. Allen. 1987. Natural Language Understanding.
Menlo Park, Cal.: Benjamin Cummings.
Collin F. Baker, Charles J. Fillmore, and John B. Lowe.
1998. The Berkeley FrameNet Project. In Proceedings
of COLING-ACL, Montreal, Canada.
Jean Carletta. 1996. Assessing agreement on classifi-
cation tasks: The kappa statistic. Computational Lin-
guistics, 22(2):249–254.
Thomas H. Cormen, Charles E. Leiserson, and Ronald R.
Rivest. 1990. Introduction to Algorithms. MIT press,
Cambridge, MA.
R.V. Cox, C.A. Kamm, L.R. Rabiner, J. Schroeter, and
J.G. Wilpon. 2000. Speech and language process-
ing for next-millenium communications services. Pro-
ceedings of the IEEE, 88(8):1314–1334.
George Demetriou and Eric Atwell. 1994. A seman-
tic network for large vocabulary speech recognition.
In Lindsay Evett and Tony Rose, editors, Proceed-
ings of AISB workshop on Computational Linguistics
for Speech and Handwriting Recognition, University
of Leeds.
Ralf Engel. 2002. SPIN: Language understanding for
spoken dialogue systems using a production system ap-
proach. In Proceedings of ICSLP 2002.
Iryna Gurevych, Robert Porzel, and Michael Strube.
2002. Annotating the semantic consistency of speech
recognition hypotheses. In Proceedings of the Third
SIGdial Workshop on Discourse and Dialogue, pages
46–49, Philadelphia, USA, July.
Diane J. Litman, Marilyn A. Walker, and Michael S.
Kearns. 1999. Automatic detection of poor speech
recognition at the dialogue level. In Proceedings of
the 37th Annual Meeting of the Association for Com-
putational Linguistics, College Park, Md., 20–26 June
1999, pages 309–316.
Martin Oerder and Hermann Ney. 1993. Word
graphs: An efficient interface between continuous-
speech recognition and language understanding. In
ICASSP Volume 2, pages 119–122.
Stuart J. Russell and Peter Norvig. 1995. Artificial In-
telligence. A Modern Approach. Prentice Hall, Engle-
wood Cliffs, N.J.
R. Schwartz and Y. Chow. 1990. The n-best algo-
rithm: an efficient and exact procedure for finding the
n most likely sentence hypotheses. In Proceedings of
ICASSP’90, Albuquerque, USA.
B-H. Tran, F. Seide, V. Steinbiss, R. Schwartz, and
Y. Chow. 1996. A word graph based n-best search
in continuous speech recognition. In Proceedings of
ICSLP’96.
Wolfgang Wahlster, Norbert Reithinger, and Anselm
Blocher. 2001. Smartkom: Multimodal communica-
tion with a life-like character. In Proceedings of the
7th European Conference on Speech Communication
and Technology, pages 1547–1550.
Marilyn A. Walker, Candace A. Kamm, and Diane J. Lit-
man. 2000. Towards developing general model of us-
ability with PARADISE. Natural Language Engeneer-
ing, 6.
