Towards Measuring Scalability in Natural Language Understanding Tasks
Robert Porzel Rainer Malaka
European Media Laboratory
Schloss-Wolfsbrunnenweg 33
D-69118 Heidelberg, Germany
frobert.porzel,rainer.malaka@eml-d.villa-bosch.deg
Abstract
In this paper we present a discussion of existing
metrics for evaluation the performance of indi-
vidual natural language understanding systems
and components as well as the commonly em-
ployed metrics for measuring the specific task
difficulties. We extend and generalize the com-
mon majority class baseline metric and intro-
duce an general entropy-based metric for mea-
suring the task difficulty of arbitrary language
understanding tasks. Finally, we show an em-
pirical study evaluating this metric followed by
a discussion of its role in measuring the scal-
ability of language understanding systems and
components.
1 Introduction
Current evaluation frameworks for uni- or multi-modal
dialogue systems (Walker et al., 2000; Beringer et al.,
2002) that allow for spoken language input do not include
metrics for measuring the accuracy of the involved inten-
tion recognition systems, simply because such informa-
tion is hard to extract automatically from log files. Fur-
thermore no general computational method or framework
for measuring the difficulty of natural language under-
standing tasks have been proposed so far. We are, there-
fore, faced with a lack of methods for measuring the dif-
ficulties of the individual tasks involved in the language
understanding process. Such generally applicable meth-
ods, however, are needed for measuring the scalability of
natural language understanding systems and components.
In this paper we first discuss existing metrics for mea-
suring task-specific performances and the correspond-
ing baseline metrics in natural language understanding
in Section 2. We, then, propose a generalized baseline-
based metric in Section 4.1 as well as a general entropy-
based metric in Section 4.2. Both methods can be em-
ployed for measuring the difficulties of various under-
standing tasks and, consequently, for evaluating natural
language understanding components involved in the in-
tention recognition process. Section 5 provides a case
study evaluation of the proposed methods. In Section 6
we discuss how an analysis of a specific system on tasks
differing in their difficulty can yield a first approach for
measuring the scalability of a natural language under-
standing systems and its components.
2 Evaluating Dialogue-, Speech- and
Discourse Understanding Systems
In this section we will briefly sketch out the most fre-
quently used metrics for evaluating the performances of
the relevant components and systems at hand.
Evaluation of the Dialogue Systems Performance:
For evaluation of the overall performance of a dia-
logue system as a whole frameworks such as PAR-
ADISE (Walker et al., 2000) for unimodal and PROMISE
(Beringer et al., 2002) for multimodal systems have set
a de facto standard. These frameworks differentiate be-
tween:
 dialogue efficiency metrics, i.e. elapsed time,
system- and user turns
 dialogue quality metrics, mean recognition score
and absolute number as well as percentages of time-
outs, rejections, helps, cancels, and barge-ins,
 task success metrics, task completion (per survey)
 user satisfaction metrics (per survey)
These metrics are crucial for evaluating the aggregate
performance of the individual components, they cannot,
however, determine the amount of understanding versus
misunderstanding or the system-specific a priori diffi-
culty of the understanding task. Their importance, how-
ever, will remain undiminished, as ways of determining
such global parameters are vital to determining the aggre-
gate usefulness and felicity of a system as a whole. At the
same time individual components and ensembles thereof
- such as the performance of the uni- or multi-modal in-
put understanding system - need to be evaluated as well
to determine bottlenecks and weak links in the discourse
understanding processing chain.
Evaluation of the Automatic Speech Recognition Per-
formance: The commonly used word error rate (WER)
can be calculated by aligning any two sets word se-
quences and adding the number of substitutions S, dele-
tions D and insertions I. The WER is then given by the
following formula where N is the total number of words
in the test set.
WER = S + D + IN  100
Another measure of accuracy that is frequently used is
the so called Out Of Vocabulary (OOV) measure, which
represents the percentage of words that was not recog-
nized despite their lexical coverage. WER and OOV
are commonly intertwined together with the combined
acoustic- and language-model confidence scores, which
are constituted by the posterior probabilities of the hidden
Markov chains and n-gram frequencies. Together these
scores enable evaluators to measure the absolute perfor-
mance of a given speech recognition system. In order
to arrive at a measure that is relative to the given task-
difficulty, this difficulty must also be calculated, which
can be done by means of measuring the perplexity of the
task (see Section 3).
Evaluation of the Natural Language Understanding
Performance: A measure for understanding rates -
called concept error rate has been proposed for example
by Chotimongcol and Rudnicky (2001) and is designed
in analogy to word error rates employed in automatic
speech recognition that are combined with keyword spot-
ting systems. Chotimongcol and Rudnicky (2001) pro-
pose to differentiate whether the erroneous concept oc-
curs in a non-concept slot that contains information that
is captured in the grammar but not considered relevant for
selecting a system action (e.g., politeness markers, such
as please), in a value-insensitive slot whose identity, suf-
fices to produce a system action (e.g., affirmatives such
as yes), or in a value-sensitive slot for which both the
occurrence and the value of the slot are important (e.g.,
a goal object, such as Heidelberg). An alternative pro-
posal for concept error rates is embedded into the speech
recognition and intention spotting system by Lumenvox1,
wherein two types of errors and two types of non-errors
for concept transcriptions are proposed:
 A match when the application returned the correct
concept and an out of grammar match when the ap-
1www.lomunevox.com/support/tunerhelp/Tuning/Concept
Transcription.htm
plication returned no concepts, or discarded the re-
turned concepts because the user failed to say any
concept covered by the grammar.
 A grammar mismatch when the application returned
the incorrect concept, but the user said a concept
covered by the grammar and an out of grammar mis-
match when the application returned a concept, and
chose that concept as a correct interpretation, but the
user did not say a concept covered by the grammar.
Neither of these measures are suitable for our pur-
poses as they are known to be feasible only for context-
insensitive applications that do not include discourse
models, implicit domain-specific information and other
contextual knowledge as discussed in (Porzel et al.,
2004). Therefore this measure has also been called key-
word recognition rate for single utterance systems. In our
minds another crucial shortcoming is the lack of compa-
rability, as these measures do not take the general dif-
ficulty of the understanding tasks into account. Again,
this has been realized in the automatic speech recogni-
tion community and led to the so called perplexity mea-
surements for a given speech recognition task. We will,
therefore, sketch out the commonly employed perplexity
measurements in Section 3.
The most detailed evaluation scheme for discourse
comprehension, introduced by Higashinaka et al. (2002)
and also extended by Higashinaka et al. (2003), features
the metrics given in Table 2.
1. slot accuracy
2. insertion error rate
3. deletion error rate
4. substitution error rate
5. slot error rate
6. update precision
7. update insertion error rate
8. update deletion error rate
9. update substitution error rate
10. speech understanding rate
11. slot accuracy for filled slots
12. deletion error rate for filled slots
13. substitution error rate for filled slots
Table 1: Discourse Comprehension Measurements
These metrics are combined by means of combin-
ing the results of an m5 multiple linear regression al-
gorithm and a support vector regression approach. The
resulting weighted sum is compared to human intuitions
and PARADISE-like metrics concerning task completion
rates and -times. While this promising approach manages
to combine factors related to speech recognition, interpre-
tation and discourse modeling, there are some shortcom-
ings that stem from the fact that this schema was devel-
oped for single-domain systems that employ frame-based
attribute value pairs for representing the user’s intent.
Recent advances in dialogue management and multi-
domain systems enable approaches that are more flexi-
ble than slot-filling, e.g. using discourse pegs, dialogue
games and overlay operations for handling multiple tasks
and cross-modal references (LuperFoy, 1992; L¨ockelt et
al., 2002; Pfleger et al., 2002; Alexandersson and Becker,
2003). More importantly - for the topic of this paper - no
means of measuring the a priori discourse understanding
difficulty is given.
Measuring Precision, Recall and F-Measures: In the
realm of semantic analyses the task of word sense dis-
ambiguation is usually regarded to be among the difficult
ones. This means it can only be solved after all other
problems involved in language understanding have been
resolved as well. The hierarchical nature and interdepen-
dencies of the various tasks are mirrored in the results
of the corresponding competitive evaluation tracts - e.g.
the message understanding conference (MUC) or SEN-
SEVAL competition. It becomes obvious that the un-
graceful degradation of f-measure scores (shown in Ta-
ble 2 is due to the fact that each higher-level task inherits
the imprecisions and omissions of the previous ones, e.g.
errors in the named entity recognition (NE) task cause re-
call and precision declines in the template element task
(TE), which, in turn, thwart successful template relation
task performance (TR) as well as the most difficult sce-
nario template (ST) and co-reference task (CO). This de-
cline can be seen in Table 2 (Marsh and Perzanowski,
1999).
NE CO TE TR ST
f  .94 f  .62 f  .87 f  .76 f  .51
Table 2: F-measure-based ( = 0:5) evaluation results of
the best performing systems of the 7th Message Under-
standing Conference
Despite several problems stemming from the prerequi-
site to craft costly gold standards, e.g. tree banks or an-
notated test corpora, precision and recall and their weigh-
able combinations in the corresponding f-measures (such
as given in Table 2), have become a de facto standard for
measuring the performance of classification and retrieval
tasks (Van Rijsbergen, 1979). Precision p states the per-
centage of correctly tagged (or classified) entities of all
tagged/classified entities, whereas recallr states the pos-
itive percentage of entities tagged/classified as compared
to the normative amount, i.e. those that ought to have
been tagged or classified. Together these are combinable
to an overall f-measure score, defined as:
F = 1 1
p + (1   )
1
r
Herein  can be set to reflect the respective importance
of p versus r, if  = 0:5 then both are weighted equally.
These measures are commonly employed for evaluating
part-of-speech tagging, shallow parsing, reference reso-
lution tasks and information retrieval tasks and sub-tasks.
An additional problem with this method is that most
natural language understanding systems that perform
deeper semantic analyses produce representations often
based on individual grammar formalisms and mark-up
languages for which no gold standards exist. For evaluat-
ing discourse understanding systems, however, such gold
standards and annotated training corpora will continue to
be needed.
3 Measuring Perplexity and Baselines
In this section we will describe the most frequently used
metrics for estimating the complexity of the tasks per-
formed by the relevant components and systems at hand.
Measuring Perplexity in Automatic Speech Recog-
nition: Perplexity is a measure of the probability
weighted average number of words that may follow af-
ter a given word (Hirschman and Thompson, 1997). In
order to calculate the perplexity B, the entropy H needs
to be given - i.e., the probability of the word sequences in
the specific language of the systemW. The perplexity is
then defined as:
H =  
X
8W
P(W)log2P(W)
B = 2H
Improvements of specific ASR systems can then con-
sequently be measured by keeping the perplexity constant
and measuring WER and OOV performance for recogni-
tion quality and confidence scores for hypothesis verifi-
cation and selection.
Measuring Task-specific Baselines: Baselines for
classification or tagging tasks are commonly defined
based on chance performance, on an a posteriori com-
puted majority class performance or on the performance
of an established baseline classification method such as
naive bayes, tf*idf or k-means . That means:
 what is the corresponding f-measure, if the evalu-
ated component guesses randomly - for chance per-
formance metrics,
 what is the corresponding f-measure if the evaluated
component always chooses the most frequent solu-
tion - for majority class performance metrics,
 what is the corresponding f-measure of the estab-
lisched baseline classification method.
Much like kappa statistics proposed by
Carletta (1996), existing employments of majority
class baselines assume an equal set of identical poten-
tial mark-ups, i.e. attributes and their values, for all
markables. Therefore, they cannot be used in a straight
forward manner for many tasks that involve disjunct
sets of attributes and values in terms of the type and
number of attributes and their values involved in the
classification task. This, however, is exactly what we
find in natural language understanding tasks, such as
semantic tagging or word sense disambiguation tasks
(Stevenson, 2003). Additionally, baseline computed on
other methods cannot serve as a means for measuring
scalability, because of the circularity involved: as one
would need a way of measuring the baseline method’s
scalability factor in the first place. Table 3 provides an
overview of the existing ways of measuring performance
and task difficulty in automatic speech recognition and
understanding.
Domain Performance Complexity
automatic
speech WER/OVV Perplexity
recognition
natural
language CER none
understanding
MUC tasks
(NE, TE, TR, f-measure baselines
ST, CO)
unimodal
dialogue PARADISE none
system
mulitmodal
dialogue PARADISE none
system
Table 3: Summary of Measurements
4 Measuring Task Difficulty
4.1 Proportional Baseline Rates
As a precursor step before this we need a clear defini-
tion of a natural language understanding task. For this
we propose to assume a MATE-like annotation point of
view, which provides a set of disjunct levels of annota-
tions for the individual discriminatory decisions that can
be performed on spoken dialogue data, ranging from an-
notating referring expressions, e.g. named entities and
their relations, anaphora and their antecedents, to word
senses and dialogue acts. Each task must, therefore, have
a clearly defined set of markables, attributes and values
for each corpus of spoken dialogue data.
As a first step we will propose a uniform and generic
method for computing task-specific majority class base-
lines for a given task Tw from the entire set of task, i.e.
T = fT1;: :: ;Tzg and Tw 2 T.
A gold standard annotation of a task features a finite set
of markable tokens C = fc1;: :: ;cng for task Tw, e.g.
if n = 2 in a corpus containing only the two ambiguous
lexemes bank and run as markables, i.e. c1 and c2 respec-
tively. For a member ci of the set C we can now define
the number of values for the tagging attribute of sense
as: Ai = fbi1;: :: ;binig. For example, for three senses of
the markable bank as c1 we get the corresponding value
set A1 = fbuilding, institution, shoreg and
for run as c2 the value set A2 = fmotion, stormg.
Note that the value sets have markable-dependent sizes.
For our toy example containing the two markables c1 for
bank and c2 for run they are:
bij bi1 bi2 bi3
A1 building institution shore
A2 motion storm
For computing the proportional majority classes we
need to compute the occurrences of a value j for a mark-
able i in a given gold standard test data set. We call this
Vij. Now we can determine the most frequently given
value and its number for each markable ci as:
V maxi = maxij2f1;:::;bigVij
For example, given a marked-up toy corpus containing
our ambiguous lexemes as shown below as task T1:
The runstorm on the bankbuilding on Mon-
day caused the bankinstitution to collapse early
this week. It employees can therefore now
enjoy a leisurely runmotion on the bankshore
of the Hudson river. It is uncertain if the
bankinstitution can be saved so that they can
runmotion back to their desks and resume their
work.
This results in the list of the value occurrences shown
below with V maxi set in bold face:
Vij bi1 bi2 bi3
c1 1 2 1
c2 2 1
We define the total number of values for a markableci as:
V Si =
n1X
j=1
Vij
With V maxi we define the majority classe baseline as:
Bi = V
maxi
V Si
If we always choose the most frequent attribute for mark-
able ci, the percentage of correct guesses correspomds to
Bi. We can now calculate the total number of values as:
V S =
nX
i=1
V Si
Based on this we can compute the task-specific propor-
tional baseline for task Tw, i.e., BTw, over the entire test
set as:
BTw = 1V S  
nX
i=1
V Si Bi = 1V S  
nX
i=1
V maxi
Thus, BTw calculates the average of correct guesses
for the majority baseline. Returning to our toy example
for c1 we get V S1 = 4 and for c2 we get V S2 = 3. Ad-
ditionally, we also get different individual majority class
baselines for each markable, i.e., for c1 we get B1 = 12,
and for c2 we get B2 = 23. We also get a total number of
values given for C (c1 and c2), i.e., V S = 7. Now we can
compute the overall baseline BT1 as:
1
7  ((4  
1
2) + (3  
2
3)) =
4
7  0:57
If we extend the corpus by an additional ambiguity, i.e.
that of the spatial and temporal readings of the lexeme on,
to yield an annotated corpus such as given below as task
T2:
The runstorm on the bankbuilding ontemporal
Monday caused the bankinstitution to collapse
early this week. It employees can therefore
now enjoy a leisurely runmotion onspatial the
bankshore of the Hudson river. It is uncertain
if the bankinstitution can be saved so that they
can runmotion back to their desks and resume
their work.
We get the list of the value occurrences shown below:
C bi1 bi2 bi3
c1 1 2 1
c2 2 1
c3 1 1
Now we can compute the overall baseline BT2 again
as:
1
9  ((4  
1
2) + (3  
2
3) + (2  
1
2)) =
5
9  0:55
The reduction by .02 points, in this case, indicates that
a method that for each markable always chooses the most
frequently occurring one would perform slightly worse
on the second corpus as compared to the first. Note that
this proportional baseline measure is able to compute the
performance of such a majority class-based method on
any data set for any task. It does as such provide a pic-
ture depicting a problem’s or task’s inherent difficulty, but
only if the distribution of values for the markables at hand
is fairly homogeneous. However, if we assume distribu-
tions of markable values such as shown below, we get
identical values for BT3 and BT4.
T3 bi1 bi2 bi3 bi4 bi5
c3 16 16 0 0 0
T4 bi1 bi2 bi3 bi4 bi5
c4 16 4 4 4 4
That is, we get:
1
32  (32  
1
2) =
1
2 = 0:5
for both task baselines BT3 and BT4 with T3 featuring the
task distribution depicted as c3 and T4 that of c4, despite
the fact task T2 was undoubtedly the more difficult one.
To create a more applicable measure for task difficulty -
i.e. one that also applies for cases of heterogeneous value
distributions - we need do need to calculate an entropy
metric that takes the individual value distributions into
account.
4.2 Measuring Markable-specific Entropy
As a means of illustrating such a markable-specific en-
tropy metric we can look at the value space for each
markable and define a minimal amount of binary deci-
sions that are on average necessary for solving the prob-
lem and compute what part of the problem is solved by
them. For example, looking at the markable c4 from
above we find that the problem can be solved by means of
the following decisions: With one decision we can parti-
tion the space between b41 and the rest (b42 through b45)
thereby assigning 16 times the value b41 to c4. This deci-
sion already solves 50% of the problem. Next we need a
second decision for partitioning the value space between
b42 ^ b43 and b44 ^ b45 and a third for cutting between b42
and b43 as well as b44 and b45 respectively. Therefore, three
decisions are needed for assigning the value 4 to b42 and
solving 12.5% of the problem. In the case of c4 the same
holds for b43, b44 and b45, giving us the following decision
and solution table with d(bji) standing for the average
amount of binary decisions necessary for solving bji :
T4 b41 b42 b43 b44 b45
c4 16 4 4 4 4
d(b4i ) 1 3 3 3 3
solved 50% 12.5% 12.5% 12.5% 12.5%
Looking at the markable c3 we find that the problem
can be solved as shown below:
T3 b31 b32 b33 b34 b35
c4 16 16 0 0 0
d(b3i ) 1 1 0 0 0
solved 50% 50% 0% 0% 0%
As an illustrative approximation of a task’s entropy
we can now compute the aggregate amount of decisions
weighted by their contribution to the overall solution
(given as its probability - i.e. 50% = :5). For c4 this
yields:
2 = (1 :5)+(3 :125)+(3 :125)+(3 :125)+(3 :125)
And for T3 we get:
1 = (1  :5) + (1  :5)
In these cases we can now say that solving the markable-
specific value distribution of c4 is more difficult than
solving that of c3, indicated by the increase of 1 point
in this quasi-entropy measure. Note that if we had a
binary decision procedure that solves f% of the cases
correctly than we get an average error rate for T4f2 of
0:9  0:9 = 0;81 whereas for T3 only 0:9.
After this approximate illustration of measuring task
difficulty via the notion of its entropy, we can now com-
pute a corresponding markable-specific entropy measure
Hci based on the standard formula:
Hci =  
niX
j=1
P(Vij)log2P(bij)
This computation yields Hc3 = 1 and Hc4 = 2, which
also reflects the difference in difficulty of T3 (consisting
of the sole markable c3) versus T4 (consisting of the sole
markable c4).
4.3 Combing Markable-specific Entropies
We propose to apply an analogous way of combining the
individual markable-specific entropies, by a weighted av-
erage, whereby the markable-specific weights are deter-
mined by V Si and the averaging based on V S. As an
example we return to our sample tasks T1 and T2.
Vij bi1 bi2 bi3 Hci V Si Hci  V Si
c1 1 2 1 1.5 4 6
c2 2 1  0.92 3  2.76
c3 1 1 1 2 2
Based on this we can defineHTw as follows:
HTw =
Pn
i=1(Hci  V
Si )
V S
Correspondingly, we get for task T1 consisting of
markables c1 and c2 a value HT1  1:25 and for task T2
consisting of markables c1, c2 and c3 a value HT2  1:35.
In much the same way as the proportional baseline rate
- only more generally applicable - this increase of 0:1
points in task entropy reflects the increase in task diffi-
culty from T1 to T2. Now, that we have a clearly de-
fined way of measuring task-specific difficulties - based
on their markable-specific entropies - we can evaluate our
approach by means of a larger experiment described be-
low.
5 Evaluating the Metrics
In the following we will report on the results of a corpus
study to evaluate the task-specific entropy measurement
proposed above. In our mind such a study can be per-
formed in the following way: Given a marked up corpus
as an evaluation gold standard we can alternate the cor-
pus’ difficulty in three potential ways:
 eliminate parts of the corpus so that the number of
values of the individual markables is decreased, we
will call this vertical pruning;
 eliminate parts of the corpus so that the number of
the individual values is decreased, we will call this
horizontal pruning;
 eliminate parts of the corpus so that both the number
of markables and their respective values is reduced,
we will call this diagonal pruning.
Since each of these procedures can increase and reduce
the overall task difficulty, we can use them to test if our
proposed task entropy measure is able to reflect that in
a non toy-world example. For our study we employ the
SMARTKOM (Wahlster, 2003) sense-tagged corpus em-
ployed in the word sense disambiguation study reported
by (Loos and Porzel, 2004). An overview of the mark-
ables and their value distributions is given in Appendix
1.
We can now compute the entropy for the whole task as:
HTwhole = 1966:180832100  0:94
For the horizontal pruning we removed all markables
were a single value assumed more than 90% of the entire
set. Intuitively that makes the task harder because we
took out the easy cases which amounted to about 20% of
the entire corpus.
We can now compute the entropy for the horizontally
eased task as:
HThorizontal = 1718:079591548  1:11
For the vertical pruning we removed all values of bi3
of the entire set. Intuitively that makes the task easier
because less decisions are necessary to solve those mark-
ables that had values in bi3.
HTvertical = 1894:893862073  0:91
For the diagonal pruning we again removed all values
of bi3 of the entire set making the task easier and removed
horizontally all markable where the majority class was
under 60%, i.e. the hardest cases.
HTdiagonal = 1729:42541936  0:89
6 Conclusion and Future Work
We have discussed various measures for evaluating per-
formance of individual components and systems and for
estimating the corresponding task complexities. Addi-
tionally, we demonstrated the feasibility to employ an
entropy-based metric for tasks that are heterogeneously
structured in terms of their markable/attribute setup as
well as attribute/value distribution. That means it can be
applied to any corpora even if they feature disjunct at-
tributes with different values of their set sizes. In a first
study, on such a heterogeneous task, we have shown that
the results of this generally applicable entropy-based met-
ric line up correspondingly to increases and decreases in
task difficulty.
In our minds this metric for measuring task difficulty
can now be employed to approach the question of measur-
ing scalability. Since it is now feasible to manipulate task
sizes and difficulties in a controlled and measurable fash-
ion, future experiments and studies can be performed that
do almost exactly the opposite from current evaluations
of systems and components. That is, instead of keep-
ing the task - test corpus - identical and measuring the
performance of different methods, we can now keep the
method identical and measure its performance on tasks
differing in their difficulty. Hereby, some open question
still have to be solved, such as evaluating and determining
suitable performance measures and formalizing the spe-
cific dimensions of scalability that can be measured using
this approach, e.g. scalability in terms of performance on
problems that are equally difficult but vary in sizeversus
problems that vary in size and difficulty to name a few.

References
Jan Alexandersson and Tilman Becker. 2003. The For-
mal Foundations Underlying Overlay. In Proceedings
of the Fifth International Workshop on Computational
Semantics (IWCS-5), Tilburg, The Netherlands, Febru-
ary.
Nicole Beringer, Ute Kartal, Katerina Louka, Florian
Schiel, and Uli T¨urk. 2002. PROMISE: A Procedure
for Multimodal Interactive System Evaluation. In Pro-
ceedings of the Workshop ’Multimodal Resources and
Multimodal Systems Evaluation, Las Palmas, Spain.
Jean Carletta. 1996. Assessing agreement on classifi-
cation tasks: The kappa statistic. Computational Lin-
guistics, 22(2):249–254.
Ananlada Chotimongcol and Alexander Rudnicky. 2001.
N-best speech hypotheses reordering using linear re-
gression. In Proceedings of Eurospeech, pages 1829–
1832, Aalborg, Denmark.
Ryuichiro Higashinaka, Noboru Miyazaki, Mikio
Nakano, and Kiyoaki Aikawa. 2002. A method
for evaluating incremental utterance understanding
in spoken dialogue systems. In Proceedings of the
International Conference on Speech and Language
Processing 2002, pages 829–833, Denver, USA.
Ryuichiro Higashinaka, Noboru Miyazaki, Mikio
Nakano, and Kiyoaki Aikawa. 2003. Evaluating
discourse understanding in spoken dialogue systems.
In Proceedings of Eurospeech, pages 1941–1944,
Geneva, Switzerland.
Lynette Hirschman and Henry Thompson. 1997.
Overview of evaluation in speech and natural language.
In R Cole, editor, Survey of the State of the Art in
Human Language Technology. Cambridge University
Press, Cambridge.
Markus L¨ockelt, Tilman Becker, Norbert Pfleger, and Jan
Alexandersson. 2002. Making sense of partial. In
Proceedings of the sixth workshop on the semantics
and pragmatics of dialogue (EDILOG 2002), pages
101–107, Edinburgh, UK, September.
Berenike Loos and Robert Porzel. 2004. Resolution
of lexical ambiguities in spoken dialogue systems. In
Proceedings of the 5th SIGdial Workshop on Discourse
and Dialogue, Boston, USA. Sumitted.
Susann LuperFoy. 1992. The representation of multi-
modal user interface dialogues using discourse pegs.
In Proceedings of the 30th Annual Meeting of the Asso-
ciation for Computational Linguistics, Newark, Del.,
28 June – 2 July 1992, pages 22–31.
Elaine Marsh and Dennis Perzanowski. 1999. MUC-7
evaluation of IE technology: Overview of results. In
Proceedings of the 7th Message Understanding Con-
ference. Morgan Kaufman Publishers.
Norbert Pfleger, Jan Alexandersson, and Tilman Becker.
2002. Scoring functions for overlay and their ap-
plication in discourse processing. In KONVENS-02,
Saarbr¨ucken, September – October.
Robert Porzel, Iryna Gurevych, and Rainer Malaka.
2004. In context: Integrating domain- and situation-
specific knowledge. In W. Wahlster, editor,SmartKom
- Foundations of Multimodal Dialogue Systems.
Sprigner, Berlin.
Mark Stevenson. 2003. Word Sense Disambiguation:
The Case for Combining Knowldge Sources. CSLI.
C. J. Van Rijsbergen. 1979. Information Retrieval, 2nd
edition. Dept. of Computer Science, University of
Glasgow.
Wolfgang Wahlster. 2003. SmartKom: Symmetric mul-
timodality in an adaptive an reusable dialog shell. In
Proceedings of the Human Computer Interaction Sta-
tus Conference, Berlin, Germany.
Marilyn A. Walker, Candace A. Kamm, and Diane J. Lit-
man. 2000. Towards developing general model of us-
ability with PARADISE. Natural Language Engeneer-
ing, 6.
