Combining Acoustic Confidences and Pragmatic Plausibility for Classifying
Spoken Chess Move Instructions
Malte Gabsdil
Department of Computational Linguistics
Saarland University
Germany
gabsdil@coli.uni-sb.de
Abstract
This paper describes a machine learning ap-
proach to classifying n-best speech recogni-
tion hypotheses as either correctly or incor-
rectly recognised. The learners are trained on
a combination of acoustic confidence features
and move evaluation scores in a chess-playing
scenario. The results show significant improve-
ments over sharp baselines that use confidence
rejection thresholds for classification.
1 Introduction
An important task in designing spoken dialogue systems
is to decide whether a system should accept (consider
correctly recognised) or reject (assume misrecognition)
a user utterance. This decision is often based on acoustic
confidence scores computed by the speech recogniser and
a fixed confidence rejection threshold. However, a draw-
back of this approach is that it does not take into account
that in particular dialogue situations some utterances are
pragmatically more plausible than others.1
This paper describes machine learning experiments
that combine acoustic confidences with move evaluation
scores to classify n-best recognition hypothesis of spoken
chess move instructions (e.g. “pawn e2 e4”) as correctly
or incorrectly recognised. Classifying the n-best recog-
nition hypotheses instead of the single-best (e.g. (Walker
et al., 2000)) has the advantage that a correct hypothe-
sis can be accepted even when it is not the highest scor-
ing recognition result. Previous work on n-best hypothe-
sis reordering (e.g. (Chotimongkol and Rudnicky, 2001))
has focused on selecting hypotheses with the lowest rel-
ative word error rate. In contrast, our approach makes
1Although it is possible to use dialogue-state dependent
recognition grammars that reflect expectations of what the user
is likely to say next, these expectations do not say anything
about the plausibility of hypotheses.
predictions about whether hypotheses should be accepted
or rejected. The learning experiments show significantly
improved classification results over competitive baselines
and underline the usefulness of incorporating higher-level
information for utterances classification.
2 Domain and Data Collection
The domain of our research are spoken chess move in-
structions. We chose this scenario as a testbed for our
approach for three main reasons. First, we can use move
evaluation scores computed by a computer chess pro-
gram as a measure of the pragmatic plausibility of hy-
potheses. Second, the domain is simple and allows us
to collect data in a controlled way (e.g. we can control
for player strength), and third, the domain is restricted in
the sense that there is only a finite set of possible legal
moves in every situation. Similar considerations already
let researchers in the 1970s choose chess-playing as an
example scenario for the HEARSAY integrated speech
understanding system (Reddy and Newell, 1974).
We collected spoken chess move instructions in a small
experiment from six pairs of chess players. All sub-
jects were German native speakers and familiar with
the rules of chess. The subject’s task was to re-play
chess games (given to them as graphical representations)
by instructing each other to move pieces on the board.
Altogether, we collected 1978 move instructions under
different experimental conditions (e.g. strong games vs.
weak games) in the following four data sets: 1) language
model, 2) training, 3) development, and 4) test.
The recordings of the language model games were
transcribed and served to construct a context free recog-
nition grammar for the Nuance 8.02 speech recogniser
which was then used to process all other move instruc-
tions with 10-best output.
2http://www.nuance.com. We thank Nuance Inc. for
making their speech recognition software available to us.
3 Baseline Systems
The general aim of our experiments is to decide whether
a recognised move instruction is the one intended by the
speaker. A system should accept correct recognition hy-
potheses and reject incorrect ones. We define the follow-
ing two baseline systems for this binary decision task.
3.1 First Hypothesis Baseline
The first hypothesis baseline uses a confidence rejection
threshold to decide whether the best recognition hypoth-
esis should be accepted or rejected. To find an optimal
value, we linearly vary the confidence threshold returned
by the Nuance 8.0 recogniser (integral values in the range
a0 a1a3a2a5a4a6a1a7a1a9a8 ) and use it to classify the training and develop-
ment data.
The best performing confidence threshold on the com-
bined training and development data was 17 with an accu-
racy of 63.8%. This low confidence threshold turned out
to be equal to the majority class baseline which is to clas-
sify all hypotheses as correctly recognised. In order to
get a more balanced distribution of classification errors,
we also optimised the confidence threshold according to
the cost measure defined in Section 5. According to this
measure, the optimal confidence rejection threshold is 45
with a classification accuracy of 60.5%.3
3.2 First Legal Move Baseline
The first legal move baseline makes use of the constraint
that user utterances only contain moves that are legal in
the current board configuration. We thus first eliminate
all hypotheses that denote illegal moves from the 10-best
output and then apply a confidence rejection threshold to
decide whether the best legal hypothesis should be ac-
cepted or rejected.
The best performing confidence threshold on the com-
bined training and test data for the first legal move base-
line was 23 with an accuracy of 92.4%. This threshold
also optimised the cost measure defined in Section 5. The
performance of both baseline systems on the test data is
reported below in Table 2 together with the results for the
machine learning experiments.
4 ML Experiments
We devise two different machine learning experiments
for selecting hypotheses from the recogniser’s n-best out-
put and from a list of all legal moves given a certain board
configuration.
In Experiment 1, we first filter out all illegal moves
from the n-best recognition results and represent the re-
maining legal moves in terms of 32 dimensional fea-
ture vectors including acoustic confidence scores from
345 is also the default confidence rejection threshold of the
Nuance 8.0 speech recogniser.
the recogniser as well as move evaluation scores from a
computer chess program. We then use machine learners
to decide for each move hypothesis whether it was the
one intended by the speaker. If more than one hypothesis
is classified as correct, we pick the one with the highest
acoustic confidence. If there is no legal move among the
recognition hypotheses or all hypotheses are classified as
incorrect, the input is rejected.
Experiment 2 adds a second classification step to Ex-
periment 1. In case an utterance is rejected in Experiment
1, we try to find the intended move among all legal moves
in the current situation. This is again defined in terms of
a classification problem. All legal moves are represented
by 31 dimensional feature vectors that include “similar-
ity features” with respect to the interpretation of the best
recognition hypothesis and move evaluation scores. Each
move is then classified as either correct or incorrect. We
pick a move if it is the only one that is classified as correct
and all others as incorrect; otherwise the input is rejected.
The average number of legal moves in the development
and training games was 35.3 with a maximum of 61.
4.1 Feature Sets
The feature set for the classification of legal move hy-
potheses in the recogniser’s n-best list (Experiment 1)
consists of 32 features that can be coarsely grouped into
six categories (see below). All features were automati-
cally extracted or computed from the output of the speech
recogniser, move evaluation scores, and game logs.
1. Recognition statistics (3): position in n-best list;
relative position among and total number of legal
moves in n-best list
2. Acoustic confidences (6): overall acoustic confi-
dence; min, max, mean, variance, standard deviation
of individual word confidences
3. Text (1): hypothesis length (in words)
4. Depth1 plausibility (10): raw & normalised move
evaluation score wrt. scores for all legal moves;
score rank; raw score difference to max score;
min, max, mean of raw scores; raw z-score; move
evaluation rank & z-score among n-best legal moves
5. Depth10 plausibility (10): same features as for
depth1 plausibility (at search depth 10)
6. Game (2): ELO (strength) of player; ply number
The feature set for the classification of all legal moves
in Experiment 2 is summarised below. Each move is rep-
resented in terms of 31 (automatically derived) features
which can again be grouped into 6 different categories.
1. Similarity (5): difference size; difference bags;
overlap size; overlap bag
2. Acoustic confidences (6): same as in Experiment 1
for best recognition hypothesis
3. Text (2): length of best recognition hypothesis (in
words) and recognised string (bag of words)
4. Depth1 plausibility (8): same as in Experiment 1
(w/o features relating to n-best legal moves)
5. Depth10 plausibility (8): same as in Experiment 1
(w/o features relating to n-best legal moves)
6. Game (2): same as in Experiment 1
The similarity features are meant to represent how
close a move is to the interpretation of the best recogni-
tion result. The motivation for these features is that the
machine learner might find regularities about what likely
confusions arise in the data. For example, the letters “b”,
“c”, “d”, “e” and “g” are phonemically similar in Ger-
man (as are the letters “a” and “h” and the two digits
“zwei” and “drei”). Although the move representations
are abstractions from the actual verbalisations, the lan-
guage model data showed that most of the subjects re-
ferred to coordinates with single letters and digits and
therefore there is some correspondence between the ab-
stract representations and what was actually said.
4.2 Learners
We considered three different machine learners for
the two classification tasks: the memory-based learner
TiMBL (Daelemans et al., 2002), the rule induction
learner RIPPER (Cohen, 1995), and an implementation
of Support Vector Machines, SVMa0 a1a3a2a5a4a7a6 (Joachims, 1999).
We trained all learners with various parameter settings
on the training data and tested them on the development
data. The best results for the first task (selecting legal
moves from n-best lists) were achieved with SVMa0 a1a3a2a5a4a7a6
whereas RIPPER outperformed the other two learners on
the second task (selecting from all possible legal moves).
SVMa0 a1a8a2a5a4a9a6 and RIPPER where therefore chosen to clas-
sify the test data in the actual experiments.
5 Results and Evaluation
5.1 Cost Measure
We evaluate the task of selecting correct hypotheses with
two different metrics: i) classification accuracy and ii) a
simple cost measure that computes a score for different
classifications on the basis of their confusion matrices.
Table 1 shows how we derived costs from the additional
number of steps (verbal and non-verbal) that have to be
taken in order to carry out a user move instruction. Note
that the cost measure is not validated against user judge-
ments and should therefore only be considered an indica-
tor for the (relative) quality of a classification.
5.2 Results
Table 2 reports the raw classification results for the differ-
ent baselines and machine learning experiments together
Class Cost Sequence
accept correct 0 instruct – move
reject correct/ 2 instruct – reject – instruct –
reject incorrect move
accept incorrect 4 instruct – move – object –
move – instruct – move
Table 1: Cost measure
with their accuracy and associated cost. Here and in sub-
sequent tables, FHa10a12a11 and FHa13a15a14 refer to the first hypoth-
esis baselines with confidence thresholds 17 and 45 re-
spectively, FLM to the first legal move baseline, and Exp1
and Exp2 to Experiments 1 and 2 respectively.
accept reject
FHa10a12a11 (Acc: 61.7% Cost: 1230)
correct 489 0
incorrect 306 3
FHa13a15a14 (Acc: 64.3% Cost: 1188)
correct 441 48
incorrect 237 72
FLM (Acc: 93.5% Cost: 358)
correct 671 0
incorrect 52 75
Exp1 (Acc: 97.2% Cost: 246)
correct 695 2
incorrect 20 81
Exp2 (Acc: 97.2% Cost: 176)
correct 731 1
incorrect 21 45
Table 2: Raw classification results
The most striking result in Table 2 is the huge classifi-
cation improvement between the first hypothesis and the
first legal move baselines. For our domain, this shows a
clear advantage of n-best recognition processing filtered
with “hard” domain constraints (i.e. legal moves) over
single-best processing.
Note that the results for Exp1 and Exp2 in Table 2 are
given “by utterance” (i.e. they do not reflect the classi-
fication performance for individual hypotheses from the
n-best lists and the lists of all legal moves). Note also
that both the different baselines and the machine learning
systems have access to different information sources and
therefore what counts as correctly or incorrectly classi-
fied varies. For example, the gold standard for the first
hypothesis baseline only considers the best recognition
result for each move instruction. If this is not the one in-
tended by the speaker, it counts as incorrect in the gold
standard. On the other hand, the first legal move among
the 10-best recognition hypotheses for the same utterance
might well be the correct one and would therefore count
as correct in the gold standard for the FLM baseline.
5.3 Comparing Classification Systems
We use the a0 a1 test of independence to compute whether
the classification results are significantly different from
each other. Table 3 reports significance results for com-
paring the different classifications of the test data. The
table entries include the differences in cost and the level
of statistical difference between the confusion matrices
as computed by the a0
a1 statistics (
a2a3a2a3a2 denotes significance
at a4a6a5a8a7
a1a7a1 a4 ,
a2a3a2 at a4a9a5a10a7
a1 a4 , and
a2 at a4a9a5a8a7
a1a12a11 ). The table
should be read row by row. For example, the top row
in Table 3 compares the classification from Exp2 to all
other classifications. The value a13 a4a6a1a12a11a15a14 a2a3a2a3a2 means that the
cost compared to FHa10a12a11 is reduced by 1054 and that the
confusion matrices are significantly different ata4a16a5a17a7 a1a7a1 a4 .
FHa18a20a19 FHa21a15a22 FLM Exp1
Exp2 a13
a4a6a1a12a11a23a14
a2a3a2a3a2 a13
a4 a1 a4a25a24
a2a26a2a3a2 a13
a4a25a27a12a24
a2a3a2a3a2 a13a29a28
a1
a2a3a2
Exp1 a13a31a30
a27a23a14
a2a3a2a3a2 a13a31a30
a14a32a24
a2a3a2a3a2 a13
a4 a4a25a24
a2a3a2a3a2
FLM a13 a27 a28 a24 a2a3a2a3a2 a13 a27a12a33a7a1 a2a3a2a3a2
FHa21a15a22 a13 a14a32a24 a2a3a2a26a2
Table 3: Cost comparisons and a0 a1 levels of significance
for all test games
Tables 4 and 5 compare the performance of the differ-
ent systems for strong and weak games (a variable con-
trolled for during data collection).
FHa18a34a19 FHa21a35a22 FLM Exp1
Exp2 a13
a11a37a36a37a27
a2a26a2a3a2 a13
a11a37a36a37a27
a2a26a2a3a2 a13
a4a6a1a12a24
a2a26a2a3a2 a13
a14 a1
a2
Exp1 a13
a11a12a24a23a27
a2a26a2a3a2 a13
a11a12a24a23a27
a2a26a2a3a2 a13
a36a12a24
a2a3a2
FLM a13
a14a32a36a37a36
a2a26a2a3a2 a13
a14a32a36a37a36
a2a26a2a3a2
FHa21a35a22 a38 a1 a2a3a2a3a2
Table 4: Cost comparisons and a0 a1 levels of significance
for strong test games
FHa18a34a19 FHa21a35a22 FLM Exp1
Exp2 a13 a14a32a27a37a36 a2a26a2a3a2 a13 a14a12a14a37a14 a2a26a2a3a2 a13 a27 a1 a2 a13 a33 a1
Exp1 a13 a14a39a11a23a36 a2a26a2a3a2 a13 a14 a4a34a14 a2a26a2a3a2 a13 a11a7a1
FLM a13 a14 a1a37a36 a2a26a2a3a2 a13 a33a12a36a23a14 a2a26a2a3a2
FHa21a35a22 a13 a14a32a24 a2a3a2a3a2
Table 5: Cost comparisons and a0 a1 levels of significance
for weak test games
The results show that the machine learning systems
perform better for the strong test data. We conjecture
that the poorer results for the weak data are due to more
bad moves in these games which receive a low evaluation
score and might therefore be considered incorrect by the
learners.
6 Conclusions
We presented a machine learning approach that combines
acoustic confidence scores with automatic move evalua-
tions for selecting from the n-best speech recognition hy-
potheses in a chess playing scenario and compared the
results to two different baselines.
The chess scenario is well suited for our experiments
because it allowed us to filter out impossible moves and
to use a computer chess program to assess the plausibil-
ity of legal moves. However, the methodology underly-
ing Experiment 1 can be applied to other spoken dialogue
systems to choose interpretation(s) from a recogniser’s n-
best output. We have successfully used this setup for clas-
sifying hypotheses in a command and control spoken di-
alogue system (Gabsdil and Lemon, subm). Experiment
2 exploits the fact that the number of possible interpreta-
tions is finite in the chess scenario. Although this obvi-
ously does not hold for many dialogue tasks, there are ap-
plications such as call routing (e.g. (Walker et al., 2000))
where the number of possible interpretations is limited in
a similar way. Instead of selecting correct interpretations,
we imagine that one could also use the proposed setup to
decide which of a finite set of dialogue moves was per-
formed by a speaker.

References
Ananlada Chotimongkol and Alexander I. Rudnicky.
2001. N-best Speech Hypotheses Reordering Using
Linear Regression. In Proceedings of EuroSpeech-01.
William W. Cohen. 1995. Fast Effective Rule Induction.
In Proceedings of ICML-95.
Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and
Antal van den Bosch. 2002. TIMBL: Tilburg Mem-
mory Based Learner, version 4.2, Reference Guide.
Available from http://ilk.kub.nl/downloads/
pub/papers/ilk0201.ps.gz.
Malte Gabsdil and Oliver Lemon. subm. Combining
acoustic and pragmatic features to predict recognition
performance in spoken dialogue systems. Submitted
to ACL-04.
Thorsten Joachims. 1999. Making Large-Scale SVM
Learning Practical. In B. Schlkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods – Sup-
port Vector Learning, pages 41–55. MIT Press.
R. Reddy and A. Newell. 1974. Knowledge and its rep-
resentation in a speech understanding system. In L.W.
Gregg, editor, Knowledge and Cognition.
Marilyn Walker, Jerry Wright, and Irene Langkilde.
2000. Using Natural Language Processing and Dis-
course Features to Identify Understanding Errors in a
Spoken Dialogue System. In Proceedings of ICML-00.
