COMBINING KNOWLEDGE SOURCES 
TO REORDER N-BEST SPEECH HYPOTHESIS LISTS 
Manny Raynez a, David Carter 1, Vassilios Digalakis 2, Patti Price 2 
(1) SRI International, Suite 23, Millers Yard, Cambridge CB2 1RQ, UK 
(2) SRI International, Speech Technology and Research Laboratory, 
333 Ravenswood Ave., Menlo Park, CA 94025-3493, USA 
ABSTRACT 
A simple and general method is described that can combine 
different knowledge sources to reorder N-best lists of hypothe- 
ses produced by a speech recognizer. The method is automat- 
ically trainable, acquiring information from both positive and 
negative examples. In experiments, the method was tested 
on a 1000-utterance sample of unseen ATIS data. 
1. INTRODUCTION 
During the last few years, the previously separate fields of 
speech and natural language processing have moved much 
closer together, and it is now common to see integrated sys- 
tems containing components for both speech recognition and 
language processing. An immediate problem is the nature of 
the interface between the two. A popular solution has been 
the N-best list-for example, \[9\]; for some N, the speech recog- 
nizer hands the language processor the N utterance hypothe- 
ses it considers most plausible. The recognizer chooses the 
hypotheses on the basis of the acoustic information in the in- 
put signal and, usually, a simple language model such as a bi- 
gram grammar. The language processor brings more sophis- 
ticated linguistic knowledge sources to bear, typically some 
form of syntactic and/or semantic analysis, and uses them to 
choose the most plausible member of the N-best list. We will 
call an algorithm that selects a member of the N-best list a 
preference method. The most common preference method is 
to select the highest member of the list that receives a valid 
semantic analysis. We will refer to this as the "highest-in- 
coverage" method. Intuitively, highest-in-coverage seems a 
promising idea. However, practical experience shows that it 
is surprisingly hard to use it to extract concrete gains. For 
example, a recent paper \[8\] concluded that the highest-in- 
coverage candidate was in terms of the word error rate only 
very marginally better than the one the recognizer considered 
best. In view of the considerable computational overhead re- 
quired to perform linguistic analysis on a large number of 
speech hypotheses, its worth is dubious. 
In this paper, we will describe a general strategy for con- 
structing a preference method as a near-optimal combination 
of a number of different knowledge sources. By a "knowledge 
source", we will mean any well-defined procedure that asso- 
ciates some potentially meaningful piece of information with 
a given utterance hypothesis H. Some examples of knowledge 
sources are 
• The plausibility score originally assigned to H by the 
recognizer 
• The sets of surface unigrams, bigrams and trigrams 
present in H 
• Whether or not H receives a well-formed syntac- 
tic/semantic analysis 
• If so, properties of that analysis 
The methods described here were tested on a 1001-utterance 
unseen subset of the ATIS corpus; speech recognition was 
performed using SRI's DECIPHER TM recognizer \[7, 5\], and 
linguistic analysis by a version of the Core Language En- 
gine (CLE \[2\]). For 10-best hypothesis lists, the best method 
yielded proportional reductions of 13% in the word error rate 
and 11% in the sentence error rate; if sentence error was 
scored in the context of the task, the reduction was about 
21%. By contrast, the corresponding figures for the highest- 
in-coverage method were a 7% reduction in word error rate, a 
5% reduction in sentence error rate (strictly measured), and 
a 12% reduction in the sentence error rate in the context of 
the task. 
The rest of the paper is laid out as follows. In Section 2 we 
describe a method that allows different knowledge sources 
to be merged into a near-optimal combination. Section 3 
describes the experimental results in more detail. Section 4 
concludes. 
2. COMBINING KNOWLEDGE 
SOURCES 
Different knowledge sources (KSs) can be combined. We be- 
gin by assuming the existence of a training corpus of N-best 
lists produced by the recognizer, e~h list tagged with a "ref- 
erence sentence" that determines which (if any) of the hy- 
potheses in it was correct. We analyse each hypothesis H 
in the corpus using a set of possible KSs, each of which as- 
sociates some form of information with H. Information can 
be of two different kinds. Some KSs may directly produce a 
number that can be viewed as a measure of H's plausibility. 
Typical examples are the score the recognizer assigned to H, 
and the score for whether or not H received a linguistic anal- 
ysis (1 or 0, respectively). More commonly, however, the KS 
will produce a list of one or more "linguistic items" associated 
with H, for example surface N-grams in H or the grammar 
rules occurring in the best linguistic analysis of H, if there 
was one. A given linguistic item L is associated with a nu- 
merical score through a "discrimination function" (one func- 
tion for each type of linguistic item), which summarizes the 
relative frequencies of occurrence of L in correct and incor- 
rect hypotheses, respectively. Discrimination functions are 
discussed in more detail shortly. The score assigned to H 
217 
by a KS of this kind will be the sum of the discrimination 
scores for all the linguistic items it finds. Thus, each KS will 
eventuMly contribute a numerical score, possibly via a dis- 
crimination function derived from an analysis of the training 
corpus. 
The totM score for each hypothesis is a weighted sum of the 
scores contributed by the various KSs. The final requirement 
is to use the training corpus a second time to compute optimal 
weights for the different KSs. This is an optimization problem 
that can be approximately solved using the method described 
in \[3\] 1 . 
The most interesting role in the above is played by the dis- 
crimination functions. The intent is that linguistic items that 
tend to occur more frequently in correct hypotheses than in- 
correct ones will get positive scores; those which occur more 
frequently in incorrect hypotheses than correct ones will get 
negative scores. To take an example from the ATIS do- 
main, the trigram a list of is frequently misrecognized by 
DECIPHER TM as a list the. Comparing the different hy- 
potheses for various utterances, we discover that if we have 
two distinct hypotheses for the same utterance, one of which 
is correct and the other incorrect, and the hypotheses differ 
by one of them containing a list o\] while the other contains 
a list the, then the hypothesis containing a list o\] is nearly 
always tile correct one. This justifies giving the trigram a list 
o\] a positive score, and the trigram a list the a negative one. 
We now define formally the discrimination function dT for a 
given type T of linguistic item. We start by defining dT as 
a function on linguistic items. As stated above, it is then 
extended in a natural way to a function on hypotheses by 
defining dT(H) for a hypothesis H to be ~ dT(L), where the 
sum is over all the linguistic items L of type T associated 
with H. 
dT(L) for a given linguistic item L is computed as follows. 
(This is a sfight generalization of the method given in \[4\].) 
The training corpus is analyzed, and each hypothesis is 
tagged with its set of associated linguistic items. We then 
find all possible 4-tuples (U, H1, H2, L) where 
• U is an utterance. 
* H1 and H2 are hypotheses for U, exactly one of which 
is correct. 
• L is a linguistic item of type T that is associated with 
exactly one of H1 and H2. 
If L occurs in the correct hypothesis of the pair (Ha, H2), we 
call this a "good" occurrence of L; otherwise, it is a "bad" 
one. Counting occurrences over the whole set, we let g be 
the total number of good occurrences of L, and b be the total 
number of bad occurrences. The discrimination score of type 
T for L, dT(L), is then defined as a function d(g, b). It seems 
sensible to demand that d(g, b) has the following properties: 
• d(g,b)>Oifg>b 
• d(g, b) = -d(b, g) (and hence d(g, b) ---- 0 if g ---- b) 
1A summary can also be found in \[11\]. 
• d(gl,b) > d(g2,b) if ga > g2 
We have experimented with a number of possible such func- 
tions, the best one appearing to be the following: 
log2(2(g-b 1)/(g -\[- b--b 2)) if g < b 
d(g,b)= o il g=b 
-log2(2(b.-t-1)/(g-t-b-.I-2)) if g >b 
This formula is a symmetric, logarithmic transform of the 
function (g + 1)/(g -t- b + 2), which is the expected a pos- 
teriori probability that a new (U, Ha,H2, L) 4-tuple will be 
a good occurrence, assuming that, prior to the quantities g 
and b being known, this probability has a uniform a priori 
distribution on the interval \[0,1\]. 
One serious problem with corpus-based measures like dis- 
crimination functions is data sparseness; for this reason, it 
will often be advantageous to replace the raw linguistic items 
L with equivalence classes of such items, to smooth the data. 
We will discuss this further in Section 3.2. 
3. EXPERIMENTS 
Our experiments tested the general methods that we have 
outlined. 
3.1. Experimental Set-up 
The experiments were run on the 1001-utterance subset of 
the ATIS corpus used for the December 1993 evaluations, 
which was previously unseen data for the purposes of the ex- 
periments. The corpus, originally supplied as waveforms, was 
processed into N-best lists by the DECIPHER TM recognizer. 
The recognizer used a class bigram language model. Each N- 
best hypothesis received a numerical plausibility score; only 
the top 10 hypotheses were retained. The 1-best sentence 
error rate was about 34%, the 5-best error rate (i.e., the 
frequency with which the correct hypothesis was not in the 
top 5) about 19%, and the 10-best error rate about 16%. 
Linguistic processing was performed using a version of the 
Core Language Engine (CLE) customized to the ATIS do- 
main, which was developed under the SRI-SICS-Telia Re- 
search Spoken Language Translator project \[1, 11, 12\]. The 
CLE normally assigns a hypothesis several different possible 
linguistic analyses, scoring each one with a plausibility mea- 
sure. The plausibility measure is highly optimized \[3\], and 
for the ATIS domain has an error rate of about 5%. Only 
the most plausible linguistic analysis was used. 
The general CLE grammar was specialized to the domain 
using the Explanation-Based Learning (EBL) algorithm \[13\] 
and the resulting grammar parsed using an LR parser \[14\], 
giving a decrease in analysis time, compared to the normal 
CLE left-corner parser, of about an order of magnitude. This 
made it possible to impose moderately realistic resource lim- 
its: linguistic analysis was allowed a maximum of 12 CPU 
seconds per hypothesis, running SICStus Prolog on a Sun 
SPARCstation 10/412. Analysis that overran the time limit 
was cut off, and corresponding data replaced by null val- 
ues. Approximately 1.2% of all hypotheses timed out during 
2All product names mentioned in this paper are the trademark 
of their respective holder. 
218 
linguistic analysis; the average analysis time required per hy- 
pothesis was 2.1 seconds. 
Experiments were carried out by first dividing the corpus into 
five approximately equal pools, in such a way that sentences 
from any given speaker were never assigned to more than one 
pool 3 . Each pool was then in turn held out as test data, and 
the other four used as training data.. The fact that utter- 
ances from the same speaker never occurred both as test and 
training data turned out to have an important effect on the 
results, and is discussed in more detail later. 
3.2. Knowledge Sources Used 
The following knowledge sources were used in the experi- 
ments: 
Max. length (words) 
Preference method 8 12 16 o¢ 
1-best 28.3 30.4 31.9 33.7 
Highest-in-coverage 26.3 27.4 30.1 32.2 
.N-gram/highest-in-coverage 26.1 27.1 29.9 131.7 
Recognizer+N-gram 25.3 27.8 29.7 31.6 
Recognizer+linguistic KSs 23.3 24.8 27.9 30.0 
All available KSs 23.5 25.4 28.1 29.9 
Lowest WE in 10-best 12.6 13.2 14.5 15.8 
# utterances in 10-best \[442 710 800 1804031 
# utterances 506 818 936 
Table 1: 10-best sentence error rates 
Recognizer score: The numerical score assigned to each 
hypothesis by the DECIPHER TM recognizer. 
This is typically a large negative integer. 
In coverage: Whether or not the CLE assigned the hy- 
pothesis a linguistic analysis (1 or 0). 
Unlikely grammar construction: 1 if the most plansi- 
ble linguistic analysis assigned to the hypothe- 
sis by the CLE was "unlikely", 0 otherwise. In 
these experiments, the only analyses tagged as 
"unlikely" are ones in which the main verb is a 
form of be, and there is a number mismatch be- 
tween subject and predicate-for example, "what 
is the fares?". 
Class N-gram discriminants (four distinct knowledge 
sources): Discrimination scores for 1-, 2-, 3- and 
4-grams of classes of surface linguistic items. The 
class N-grams are extracted after some surface 
words are grouped into multi-word phrases, and 
some common words and groups are replaced with 
classes; the dummy words *START* and *END* 
are also added to the beginning and end of the 
list, respectively. Thus, for example, the utter- 
ance one way flights to d f w would, after this 
phase of processing, be *START* flight_type_adj 
flights to airport_name *END*. 
Grammar rule discriminants: Discrimination scores for 
the grammar rules used in the most plausible lin- 
guistic analysis of the hypothesis, if there was one. 
Semantic triple diseriminants: 
Discrimination scores for "semantic triples" in 
the most plausible linguistic analysis of the hy- 
pothesis, if there was one. A semantic triple is 
of the form (Head1, Rel, Head2), where Head1 
and Head2 are head-words of phrases, and Rel is 
a grammatical relation obtaining between them. 
Typical values for Rel are "subject" or "object", 
when Head1 is a verb and Head2 the head-word 
of one of its arguments; alternatively, Rel can be 
a preposition, if the relation is a PP modification 
of an NP or VP. There are also some other possi- 
bilities (cf. \[3\]). 
3We would llke to thank Bob Moore for suggesting this idea. 
The knowledge sources naturally fall into three groups. The 
first is the singleton consisting of the "recognizer score" KS; 
the second contains the four class N-gram discriminant KSs; 
the third consists of the remMning "linguistic" KSs. The 
method of \[3\] was used to calculate near-optimal weights for 
three combinations of KSs: 
1. Recognizer score + class N-gram discriminant KSs 
2. Recognizer score + linguistic KSs 
3. All available KSs 
To facilitate comparison, some other methods were tested 
as well. Two variants of the highest-in-coverage method 
provided a lower limit: the "straight" method, and one in 
which the hypotheses were first rescored using the optimized 
combination of recognizer score and N-gram discriminant 
KSs. This is marked in the tables as "N-gram/highest-in- 
coverage", and is roughly the strategy described in \[6\]. An 
upper limit was set by a method that selected the hypothesis 
in the list with the lower number of insertions, deletions and 
substitutions. This is marked as "lowest WE in 10-best". 
3.3. Results 
Table 1 shows the sentence error rates for different preference 
methods and utterance lengths, using 10-best lists; Table 2 
shows the word error rates for each method on the full set. 
The absolute decrease in the sentence error rate between 1- 
best and optimized 10-best with all KSs is from 33.7% to 
29.9%, a proportionM decrease of 11%. This is nearly exactly 
the same as the improvement measured when the lists were 
rescored using a class trigram model, though it should be 
stressed that the present experiments used far less training 
data. The word error rate decreased from 7.5% to 6.4%, a 
13% proportional decrease. Here, however, the trigram model 
performed significantly better, and achieved a reduction of 
22%. 
It is apparent that nearly all of the improvement is com- 
ing from the linguistic KSs; the difference between the lines 
"recognizer + linguistic KSs" and "all available KSs" is not 
significant. Closer inspection of the results also shows that 
the improvement, when evaluated in the context of the spo- 
ken languagetranslation task, is rather greater than Table 1 
219 
Preference method Word Error (%) 
1-best 7.4 
Highest-in-coverage 6.9 
Recognizer+N-gram KSs 6.8 
N-gram/highest-in-coverage 6.7 
Recognizer+linguistic KSs 6.5 
All available KSs 6.4 
Lowest WE in 10-best 3.0 
Table 2: 10-best word error rates 
would appear to indicate. Since the linguistic KSs only look 
at the abstract semantic analyses of the hypotheses, they of- 
ten tend to pick harmless syntactic variants of the reference 
sentence; for example all el the can be substituted for all the 
or what are ... for which are .... When syntactic variants 
of this kind are scored as correct, the figures are as shown 
in Table 3. The improvement in sentence error rate on this 
method of evaluation is from 28.8% to 22.8%, a proportional 
decrease of 21~0. On either type of evaluation, the difference 
between "all available KSs" and any other method except 
"recognizer + linguistic KSs" is significant at the 5% level 
according to the McNemar sign test \[10\]. 
One point of experimental method is interesting enough to 
be worth a diversion. In earlier experiments, reported in the 
notebook version of this paper, we had not separated the data 
in such a way as to ensure that the speakers of the utterances 
in the test and training data were always disjoint. This led 
to results that were both better and also qualitatively differ- 
ent; the N-gram KSs made a much larger contribution, and 
appeared to dominate the linguistic KSs. This presumably 
shows that there are strong surface uniformities between ut- 
terances from at least some of the speakers, which the N-gram 
KSs can capture more easily than the linguistic ones. It is 
possible that the effect is an artifact of the data-collection 
methods, and is wholly or partially caused by users who re- 
peat queries after system misrecognitions. 
For a total of 88 utterances, there was some acceptable 10- 
best hypothesis, but the hypothesis chosen by the method 
Max. length (words) 
Preference method 8 12 16 co 
1-best 24.3 26.0 27.5 28.8 
Highest-in-coverage 20.4 21.5 23.7 25.3 l 
Recognizer+N-gram KSs 20.4 22.5 23.8 25.2 
N-gram/highest-in-coverage 19.0 20.5 22.6 24.1 
Recognizer+linguistic KSs 17.6 19.6 21.7 23.5 
All available KSs i 17.6 19.6 21.5 22.8 
Lowest WE in 10-best 11.3 12.0 13.0 14.0 
# utterances 506 818 936 1001 
Table 3: 10-best sentence error rates counting acceptable 
variants as successes 
Apparently impossible 14 
Coverage problems 44 
Clear preference failure 2i 
Uncertain 9 
Table 4: Causes of N-best preference failure 
that made use of all available KSs was unacceptable. To get a 
more detailed picture of where the preference methods might 
be improved, we inspected these utterances and categorized 
them into different apparent causes of failure. Four main 
classes of failure were considered: 
Apparently impossible: There is no apparent reason to 
prefer the correct hypothesis to the one cho- 
sen without access to intersentential context or 
prosody. There were two main subclasses: either 
some important content word was substituted by 
an equally plausible alternative (e.g. "Minneapo- 
lis" instead of "Indianapolis"), or the utterance 
was so badly malformed that none of the alterna- 
tives seemed plausible. 
Coverage problem: The correct hypothesis was not in 
implemented linguistic coverage, but would prob- 
ably have been chosen if it had been; alternately, 
the selected hypothesis was incorrectly classed as 
" being in linguistic coverage, but would probably 
not have been chosen if it had been correctly clas- 
sifted as ungrammatical. 
Clear preference failure: The information needed to 
make the correct choice appeared intuitively to 
be present, but had not been exploited. 
Uncertain: Other cases. 
The results are summarized in Table 4. 
At present, the best preference method is in effect able to 
identify about 40% of the acceptable new hypotheses pro- 
duced when going from 1-best to 10-best. (In contrast, the 
"highest-in-coverage" method finds only about 20%.) It ap- 
pears that addressing the problems responsible for the last 
three failure categories could potentially improve the pro- 
portion to something between 70% and 90%. Of this in- 
crease, about two-thirds could probably be achieved by suit- 
able improvements to linguistic coverage, and the rest by 
other means. It seems plausible that a fairly substantial pro- 
portion of the failures not due to coverage problems can be 
ascribed to the very small quantity of training data used. 
4. CONCLUSIONS 
A simple and uniform architecture combines different knowl- 
edge sources to create an N-best preference method. The 
method can easily absorb new knowledge sources as they 
become available, and can be automatically trained. It is 
economical with regard to training material, since it makes 
use of both correct and incorrect recognizer hypotheses. It 
is in fact to be noted that over 80% of the discrimination 
220 
scores are negative, deriving from incorrect hypotheses. The 
apparent success of the method can perhaps most simply be 
explained by the fact that it attempts directly to model char- 
acteristic mistakes made by the recognizer. These are often 
idiosyncratic to a particular recognizer (or even to a particu- 
lar version of a recognizer), and will not necessarily be easy 
to detect using more standard language models based on in- 
formation derived from correct utterances only. 
We find the initial results described here encouraging, and in 
the next few months intend to extend them by training on 
larger amounts of data, refining existing knowledge sources, 
and adding new ones. In particular, we plan to investigate 
the possibility of improving the linguistic KSs by using partial 
linguistic analyses when a full analysis is not available. We 
are also experimenting with applying our methods to N-best 
lists that have first been rescored using normal class trigram 
models. Preliminary results indicate a proportional decrease 
of about 7% in the sentence error rate when syntactic vari- 
ants of the reference sentence are counted as correct; this is 
significant according to the McNemar test. Only the finguis- 
tic KSs appear to contribute. We hope to be able to report 
these results in more detail at a later date. 
ACKNOWLEDGEMENT 
The work we have described was accomplished under contract 
to Tefia Research. 
References 
1. AgnEs, M-S., Alshawi, H., Bretan, I., Carter, D.M. 
Ceder, K., Collins, M., Crouch, R., Digalakis, V., 
Ekholm, B., Gamb£ck, B., Kaja, J., Karlgren, J., Ly- 
berg, B., Price, P., Pulman, S., Rayner, M., Samuels- 
son, C. and Svensson, T., Spoken Language Transla- 
tor: First Year Report, joint SRI/SICS technical report, 
1994. 
2. Alshawi, H., The Core Language Engine, Cambridge, 
Massachusetts: The MIT Press, 1992. 
3. Alshawi, H. and Carter, D.M., Training and Scaling 
Preference Functions for Disambiguation, SRI Techni- 
cal Report, 1993. 
4. Collins, M.J. The Use of Semantic Collocations in Pref- 
erence Metrics and Word Similarity Measures, M Phil 
Thesis, Cambridge University, Cambridge, England, 
1993. 
5. Digalakis, V. and Murveit, H., "Genones: Optimizing 
the Degree of Tying in a Large Vocabulary HMM Speech 
Recognizer", Proc. Inter. Conf. on Acoust., Speech and 
Signal Proc., 1994. 
6. Kubala, F., Barry, C., Bates, M., Bobrow, R., Fung, P., 
Ingria, R., Makhoul, J., Nguyen, L., Schwartz, R. and 
Stallard, D., "BBN Byblos and Harc February 1992 
ATIS Benchmark Results", Proe. DARPA Workshop on 
Speech and Natural Language, 199P. 
7. Murveit, H., Butzberger, J., Digalakis, V. and Wein- 
traub, M., "Large Vocabulary Dictation using SRI's 
DECIPHER TM Speech Recognition System: Progres- 
sive Search Techniques", Proc. Inter. Conf. on Acoust., 
Speech and Signal Proc., Minneapolis, Minnesota, April 
1993. 
8. Norton, L.M., Dahl, D.A. and Linebarger, M.C., "Re- 
cent Improvements and Benchmark Results for the Para- 
max ATIS System". Proc. DARPA Workshop on Speech 
and Natural Language, 1992. 
9. Ostendorf, M., et al., "Integration of Diverse Recog- 
nition Methodologies Through Reevaluation of N-best 
Sentence Hypotheses," Proc. DARPA Workshop on 
Speech and Natural Language, 1991. 
10. Powell, F.C., Cambridge Mathematical and Statistical 
Tables, Cambridge University Press, Cambridge, Eng- 
land, 1976. 
11. Rayner, M., Alshawi, H., Bretan, I., Carter, D.M., Di- 
ga\]akis, V., Gamb£ck, B., Kaja, J., Karlgren, J., Ly- 
berg, B., Price, P., Pulman, S. and Samuelsson, C., "A 
Speech to Speech Translation System Built From Stan- 
dard Components". Proc. ARPA workshop on Human 
Language Technology, 1993 
12. Rayner, M., Bretan, I., Carter, D., Collins, M., Di- 
galakis, V., Gamb£ck, B., Kaja, J., Karlgren, J., Ly- 
berg, B., Price, P., Pulman S. and Samuelsson, C., "Spo- 
ken Language Translation with Mid-90's Technology: A 
Case Study". Proc. Eurospeeeh '93, Berlin, 1993. 
13. Samuelsson, C. and Rayner, M., "Quantitative Evalu- 
ation of Explanation-Based Learning as a Tuning Tool 
for a Large-Scale Natural Language System". Proc. 1Pth 
International Joint Conference on Artificial Intelligence. 
Sydney, Australia, 1991. 
14. Samuelsson, C., Fast Natural Language Parsing Using 
Explanation-Based Learning, PhD thesis, Royal Insti- 
tute of Technology, Stockholm, Sweden, 1994. 
221 
