Recent Progress on the SUMMIT System 
Victor Zue, James Glass, David Goodine, Hong Leung, 
Michael Phillips, Joseph Polifroni, and Stephanie Seneff 
Room NE43-601 
Spoken Language Systems Group 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
Introduction 
The s u M M IT system is a speaker-independent, continuous- 
speech recognition system that we have developed at MIT 
\[12\]. To date, the system has been ported to a variety of 
tasks with vocabulary sizes up to 1000 words and perplexi- 
ties up to 73. The architecture of this system is a product 
of two guiding principles. First, we desired a framework that 
could be flexible and modular so that we could explore al- 
ternative strategies for embedding speech knowledge into the 
system. Second, we required that the system be stochastic 
and trainable from a large body of speech data to account 
for our current incomplete knowledge of the acoustic realiza- 
tion of speech. The current implementation of the system 
is a reflection of both of these ideas. SUMMIT differs from 
the majority of prevailing HMM approaches in many respects 
ranging from its use of auditory models and selected acoustic 
measurements, to its segmental framework and use of pro- 
nunciation networks. In time, the specific implementation of 
these ideas will undoubtedly be modified as we discover su- 
perior techniques and approaches. Until phonetic and word 
recognition accuracies are competitive with those of human 
listeners however, we believe it will be appropriate to incor- 
porate both notions of flexibility and trainability into the 
system. 
In the past year we have focused our attention on a larger 
spoken-language effort which integrates SUMMIT with a nat- 
ural language system. We have also investigated issues which 
relate to phonetic recognition; namely alternative segmental 
representations and classification techniques. In addition, we 
have changed our normalization procedure to make it more 
amenable for recording spontaneous speech. In this paper we 
first describe the normalization procedure. We then review 
our segmental framework, and describe two alternatives we 
investigated. Finally, we present some phonetic classification 
and recognition experiments which assess the different seg- 
mental representations and classification techniques. These 
experiments indicated an improvement in classification rate 
of 4%, and in recognition rate of 8%. 
Normalization 
Most speech recognition systems that measure some type 
of absolute value, such as the short-term energy, require that 
an utterance is first normalized before processed. In the SUM- 
MIT system for instance, a weak signal that is not normalized 
will produce auditory outputs which are smaller than the 
spontaneous firing rate and will therefore be indistinguish- 
able from silence. Currently we use a normalization procedure 
that scales a speech waveform with respect to its maximum 
magnitude value. This simple procedure has proven quite 
effective for the majority of speech processing and recogni- 
tion applications since 1) the utterance has typically been 
completely recorded before it is being normalized, 2) any ex- 
traneous loud noises or clicks have been edited from the ut- 
terance so that the largest value in the waveform corresponds 
to speech (usually a vowel) and, 3) the speech intensity has 
been relatively uniform throughout the utterance. 
In situations where a human is interacting with a machine, 
some of these assumptions become less reasonable. From a 
computational perspective, it would be desirable to be able 
to process the signal in near or less than real-time. Another 
difficulty - .~ spontaneous speech is that there are frequently 
spurious noises and clicks which occasionally have the largest 
magnitude in the signal. In this case, an utterance is not 
normalized with respect to speech, so the speech waveform 
values are subsequently weaker than normal. Finally, if the 
speech intensity does change somewhat over the course of the 
utterance, a static scaling value will result in a weaker speech 
signal in some portions of the utterance. 
Normalization Procedure 
The normalization procedure we are investigating uses a 
standard feedback system to compute a gain, g\[n\]. In order 
to provide the gain some time to adapt to changing signal 
conditions, there is a delay of N frames between the point 
where the gain is computed and where the speech signal is 
scaled. Currently, a frame is computed every 25 ms, and the 
delay is 8 frames, or 200 ms. The gain control mechanism is 
illustrated in Figure 1. The system input, x\[n\], is a function 
of the recent short-term energy (computed with 25 ms rect- 
angular window) of the speech signal, sin\]. In order to reduce 
the amount of change in the gain output, the input x\[n\] is 
the maximum energy value within the previous N frames, 
z\[n\] : miaxs\[i \] n -- Y < i < n 
The error signal, e\[n\], is generated by comparing the scaled 
value y\[n\] with a target level, t\[n\]. The scaled value is gen- 
erated from the input and the previous gain value, gin - 1\]. 
All values are in dB. The influence of e\[n\] on the final gain 
is controlled by a sealing parameter, a, which controls the 
influence of the error signal on the gain. 
380 
Figure h Normalization Schematic 
Since the normalization structure assumes nothing about 
the signal to noise ratio, the entire signal is normalized equally 
whether or not the signal corresponds to speech or noise. If 
we could produce a value p,\[n\], corresponding to the proba- 
bility of speech at time n, then we could modify the above 
algorithm. One possible modification would make the target 
level depend on p,\[n\]. When p,\[n\] = 1, then t\[n\] = T, the 
target level for speech. When po\[n\] = 0, then t\[n\] = y\[n\] (i.e., 
e\[n\]= 0). In effect, when the signal is considered to be noise, 
the gain does not change. The net result of this argument is 
to make 
t\[n\] = p,\[n\]T, + (1 - po\[nl)Y\[n\] 
or 
Thus, we make the error proportional to the probability that 
we have speech. Note that another alternative procedure 
would be to reduce the gain. 
In addition to an estimate of the presence of speech in 
the signal, we can improve the performance of the normaliza- 
tion procedure if we can set a limit on the minimum allowable 
signal to noise ratio. This minimum can be used to set an up- 
per limit on the gain value, and can be used for initialization 
purposes. 
Evaluation 
Since the normalization procedure we have described is a 
non-linear operation, it is difficult to know how to quantify its 
performance. We have chosen to use word recognition accu- 
racy as one guideline. Currently, we are incorporating a ver- 
sion of this normalization procedure into the SUMMIT system 
and are evaluating the resulting performance as a function of 
the various parameter values. 
Segmental Representations 
As has been described previously \[12\], the SUMMIT sys- 
tem is based on a segmental framework. During the acoustic- 
phonetic analysis, a phonetic unit is mapped to a segment 
explicitly delineated by a begin and end time in the speech 
signal. Segmental frameworks have been investigated by oth- 
ers \[3,9\] and contrast with the prevailing frame-based struc- 
ture used by most HMM systems sitcom, where sequences of 
observation frames are assumed to be statistically indepen- 
dent from each other. We believe that a segmental framework 
offers us more flexibility than is afforded by a frame-based ap- 
proach, and could ultimately lead to superior modelling of the 
temporal variations in the realization of underlying phonolog- 
ical units. It is for this reason that we base our system on 
such an approach. 
In this section we review the current segmental framework 
of the SUMMIT system and present some alternatives that we 
have investigated. These are discussed again in the following 
section where we present some phonetic recognition results. 
Segmental Formulation 
In our framework, we try to maximize the probability of a 
sequence of units. For phonetic recognition, we represent the 
probability of a sequence of phones, ai = {cqb c~i2, .., O~N} as, 
where N~ is the number of phonetic units in 81, Sj is the jth 
possible sequence of N~ connecting segments traversing the 
utterance, p(c~ik Isjk) is the probability of observing a phone in 
a given segment, and p(sik) is the probability that a segment 
exists. During phonetic recognition, we select the sequence, 
&, that maximizes Equation 1. Thus~ 
p(,~) > p(,~,) vi 
In order to perform recognition, we need to estimate the two 
probability measures in Equation 1. The first term, p(c~ls), is 
a set of a-posteriori classification probabilities which we will 
discuss later in this paper. The second term, p(s), is a set of 
probabilities of valid segments which we will now describe in 
more detail. 
Segment Probabilities 
In order to estimate the segment probabilities, p(s), we 
have formulated segmentation into a boundary classification 
problem. Let {~, b2, .., bh} be the set of possible boundaries 
that might exist within segment, si. We define p(si) to be 
the probability that all internal boundaries do not exist. To 
reduce the complexity of the problem, we assume that all 
boundaries exist independently. Thus, 
p(s~) =p(b~,b2,..,hk) =P(b0P(~)-..P(bk) (2) 
where p(bj) stands for the probability that the jth boundary 
does not exist. Viewing the problem as one of classification, 
the boundary classes can be generated by aligning a phonetic 
transcription with a segment space through recognition or 
some other procedure. 
Search Space Issues 
One of the disadvantages of a segmental framework is that 
the amount of computation associated with a search can be 
significant. If there are N observation boundaries in an ut- 
terance then there are 2 N-2 possible segmentations of these 
N boundaries, and the number of possible segments, N,, is 
N(N-1)/2. Conceptually, we can consider these segments as 
being part of a large segment network which spans from the 
beginning of an utterance to its end. Since phonetic classifi- 
cation is needed in each possible segment, the amount of com- 
putation and the amount of search throughout the segment 
network, can be prohibitively large. For example, if the ut- 
terance is 3 seconds long and the boundaries occur 200 times 
per second (our standard analysis-rate), then No ~ 180,000. 
381 
Several procedures can be adopted to improve the effi- 
ciency of a search involved in a segmental framework. First, 
it is clear that certain search strategies, such as those that 
involve dynamic programming, can reduce the amount of 
search. Currently, we use a modified Viterbi search for pho- 
netic and word recognition \[12\]. In the following paragraphs 
we discuss some of the other procedures we have explored for 
reducing the search space. 
Pruning Boundaries 
If we can determine a subset of boundaries, B, that con- 
tain all boundaries of interest, we can substantially reduce 
the size of the segment network. As we saw previously, the 
number of segments in a full network varies as the square of 
the number of boundaries. In our current system for instance, 
we use a boundary detector that, on average, locates bound- 
aries less than one in every five frames. Thus, the number of 
segments is reduced by more than an order of magnitude. 
Pruning Segments 
There are many alternative tactics that can be used to 
reduce the size of the segment space. For example, Kopec 
and Bushused conservative duration estimates to eliminate 
many candidate segments \[3\]. In our current implementa- 
tion we have used an acoustic basis for determining a set 
of regions, organizing the N boundaries into a hierarchical 
structure called a dendrogram \[2\]. Such a hierarchy produces 
2N-3 possible segments, which reduces the size of a segment 
network by a factor of approximately N 7" 
Distortions 
While eliminating boundaries and segments from the search 
space can substantially reduce the size of the search space, it 
can also increase the amount of distortion involved in match- 
ing a sequence of phonetic units to a segment network. In 
cases where there is no segment to match a phone, we say 
there has been a deletion of a boundary. In cases where there 
is no single segment to match a phone we say there has been 
an insertion of a segment. Where there is an alignment be- 
tween a region and a phone there may be a certain amount of 
distortion involved in the alignment, which ultimately might 
cause a classification error. 
Clearly, if the segment network is pruned there should be 
some mechanism for handling insertions and deletions. As 
the previous paragraphs have pointed out, there are various 
amounts of pruning that can be done, each attaining a cer- 
tain level of phonetic-and word recognition performance. In 
exploring the various possibilities our goal is to understand 
the behavior of the system with different a~nounts of pruning, 
and to maximize the phonetic and word recognition perfor- 
mance. Given these goals, we now describe some of the mod- 
ifications we have made to our pruning strategies. We then 
describe some evaluations we made of alternative segmental 
approaches. 
Boundary Modifications 
Currently in the SUMMIT system the set of boundaries, B, 
that are used are determined by a sensitive edge detector that 
essentially locates local maxima in a spectral derivative func- 
tion \[12\]. This procedure appears to be quite robust since it 
operates on local relative changes in the speech signal. How- 
ever, we have observed that there is a substantial number of 
boundaries located during portions of silence in the speech 
signal. This phenomenon has become more pronounced with 
our emphasis on spontaneous speech because speakers tend 
to false start and hesitate more often than when they are 
reading. As a result, there can be a significant silence period 
and/or non-linguistic sounds at the beginning and the end 
of the sentence. One consequence of this is that the system 
spends a large amount of time analyzing the many segments 
produced during periods of silence. 
Aiming at reducing computational load in different parts 
of the system, we have been investigating techniques to prune 
the silence regions. We have incorporated into SUMMIT a 
simple algorithm which uses scores for speech and silence 
based on the distributions of eight principle components of 
the mean rate response outputs of an auditory model \[11\]. 
We trained the system by collecting histograms of parameter 
distributions for phonetically transcribed utterances from a 
spontaneous-speech database\[13\]. The probability of speech 
is computed on a frame by frame basis, after some tempo- 
ral smoothing. A two-state Markov model contains a-priori 
transition probabilities that are incorporated into the score 
in order to delineate long regions of silence or speech. We are 
in the process of evaluating the effect of this procedure on 
word recognition accuracy. 
Dendrogram Modifications 
In the hierarchical clustering procedure currently used in 
the SUMMIT system we require a metric, D, to compute a 
distance between two adjacent regions. Specifically, let g~ 
be the acoustic vector corresponding to the i th region. Then 
D~ = D(~, a~+l) corresponds to the distance between the i th 
and i + i st regions. In our clustering procedure we merge two 
adjacent regions when, 
Di < min{Di-l,Di+l} 
The current procedure uses a weighted Euclidean distance 
metric. The acoustic vector is an average spectral vector of 
the segment. We have observed two problems that occur in 
the resulting dendrograms. Due to the Euclidean metric, lit- 
tle weight is given to correlation across adjacent channels in 
the spectral representation. Thus, the acoustic structure can- 
not always distinguish similar sounds from those whose spec- 
tral shape is significantly different. Second, local extrema in 
the representation were not adequately reflected in the re- 
sulting dendrogram structure. The combined effects of these 
phenomena was to produce a large phonetic alignment dis- 
tortion for certain sequences of sounds. 
We attempted to address these issues by first translating 
the spectral representation using principle component anal- 
ysis. Each dimension was then normalized by its mean and 
variance. An additional rotation was made based on the aver- 
age within-class variance of each dimension. These variances 
were generated from aligned phonetic transcriptions and were 
intended to equalize the contribution of each component di- 
mension to the overall distance metric. We explored several 
382 
distance metrics using this representation and found that an 
effective one was based on a normalized dot-product, 
which is proportional to a Euclidean distance between the 
normalized acoustic vectors. 
The problem with tl~e local extrema was addressed by in- 
corporating more information into the distance metric. Specif- 
ically, we computed difference vectors, ~ = 5/ - ai+l and 
computed D(~, ~+1). The final distance metric, D*, used a 
combination of both metrics, 
Evaluation 
The evaluation of various segmental representations can 
take many forms. The first analysis we performed aligned a 
phonetic transcription with the segment network. The align- 
ment procedure mapped a phonetic token to the closest re- 
gion that overlapped by at least 50%. If no region was found 
a deletion or insertion was made. Table 1 summarizes the 
statistics of the various representations we have described on 
150 TIMIT utterances. These utterances contained 5636 pho- 
netic tokens. From the Table we see that the modified hi- 
erarchical procedures reduce the insertion rate of the current 
representation by more than one third while the deletion rates 
are slightly reduced. Finally, a full segment network created 
from the entire boundary set B has essentially no insertions. 
The minimum deletion rate is just under 2%. 
Segment 
SUMMIT 
D • 
Y 
Deletion (%) Insertion (%) 
3.5 5.7 
3.5 4.7 
3.0 3.7 
1.9 0 
Table 1: Phonetic alignment performance. 
Acoustic-Phonetic Analysis 
In the previous section we outlined the segmental rep- 
resentation that is being used in the SUMMIT system and 
described some alternatives to the current implementation. 
In this section we describe some experiments we have per- 
formed to explore alternative methods for phonetic classifica- 
tion based on multi-layer perceptrons (MLP). These exper- 
iments involve both phonetic classification and recognition. 
They compare classification techniques as well as segmental 
representations. In the next sections we describe the task, 
and speech corpus used for the experiments, as weU as the 
MLP classifier and data representations that were used. This 
is followed by a description of the phonetic classification and 
finally the phonetic recognition experiments. 
Task and Corpus 
All experiments were based on the sx sentences from 350 
speakers of the TIMIT database \[4\]. 1500 sentences from 300 
speakers were used for training, and 250 sentences from the 
remaining 50 speakers were used for testing. As summarized 
in Table 2, there were a total of 55,000 phonetic tokens in 
the training data and 9,000 tokens in the testing data. There 
were 38 phonetic labels used which represented 14 vowels. 3 
semivowels, 3 nasals, 8 fricatives, 2 affricates, 6 stops, 1 flap. 
and silence. This particular set was chosen because it has 
been used in other evaluations both within and outside our 
research group \[6,12\]. 
Training 
Speakers (M/F) 
300 (216/s4) 
Test Training Test 
Speakers(M/F) Tokens Tokens 
50 (33/17) t 55.000 9.000 
Table 2: Corpus used for the experiments. 
MLP Classifier 
Recently, we have been investigating the use of multi-layer 
perceptrons for phonetic classification. Our work was moti- 
vated by the belief that these networks might offer a flexible 
framework for us to utilize our improved, albeit incomplete, 
speech knowledge. Until recently, our study was performed on 
the constrained task of classifying the 16 vowels in American 
English, spoken by many speakers and excised from continu- 
ous speech \[8\]. Our encouraging results suggested that MLP 
is a promising technique worthy of further investigation. 
We extended our work to one of classifying 38 vowels and 
consonants. In moving to this larger phonetic classification 
problem, we discovered some major problems in training the 
network. In this section, we will describe these problems and 
suggest some procedures to overcome them. 
Initialization 
Without any a priori knowledge, the connection weights 
of MLP are often randomly initialized. Since the transition 
region of the sigmoid function is relatively narrow while the 
saturation regions are relatively wide, randomly initializing 
the network can have the adverse effect of causing most of 
the basic units to operate in the saturation regions of the sig- 
moid function, where learning is slower than in the transition 
region. 
Let z~ = Ej wijxj, where zi is the input to unit i. If we as- 
sume that the weights, w~j, are randomly initialized such that 
they are uncorrelated with zero mean and constant variance. 
It can be shown that \[8\]: 
2 2~E\[x~\]. (3) Eiz,\] = ~ E\[w,ilE\[xj\] = 0 and a,, = a~ 
J J 
Thus although the expected value of the input to the sigmoid 
2 depends on the of a basic unit, E\[z~\], is zero, the variance, a,~ , 
variance of the random weights, 2 a~, as well as the magnitudes 
2 is and the number of dimensions of the input vectors. If a~ 
large, many basic units may operate in the saturation regions. 
Normalization of Inputs 
Several procedures have been suggested to enable the ba- 
sic units to operate initially in the transition regions. By 
initializing the network with small random weights or biasing 
the inputs, a~ can be reduced \[1\]. A method called center 
initialization has also been suggested that guarantees all the 
basic units initially operate at the center of the transition 
region, where learning is fastest \[8\]. 
383 
However, as training proceeds, the connection weights are 
2 depend progressively more changed, resulting in E\[z~\] and cry, 
on the set of input vectors, £. If E\[xj\] and ~jE\[z~\] are 
large, the hidden units may get driven well into the saturation 
regions, hindering the learning capability of the network. 
Let {§} denote the set of training samples, where g = 
\[sl, s~ .... \]t is the speech-vector for each phoneme token. Fur- 
thermore, let 
~j = (~J - si____._), (4) 
~o'3 
where ¢j is the standard deviation of sj, gj is the mean of 
sj over all the training tokens, and 7 is a positive constant. 
Thus, E\[xj\] = E\[z~\] = 0. Assuming "7o'j > 1, 
= < E (51 j "7o'j j 
Thus by subtracting and normalizing the input patterns ac- 
cording to Equation (4), the hidden units may operate more 
often in the transition region of the sigmoid function. 
Adaptive Gain 
Although Equation (4) provides a mechanism to increase 
the learning capability of the network, it requires that ~j and 
a,~ be compute d before the network is trained. In this sec- 
tion, we discuss a different technique to deal with the learning 
capability of the network. 
During training, the connection weights are usually modi- 
fied according to Aw~j = q$ixj, where r/is the gain term, and 
$i is the error signal for unit i. (For simplicity, the momen- 
tum term is ignored in the following analysis.) Thus IAwij\] 
depends on txjh which means the training procedure pays 
more attention to inputs with larger magnitudes than to in- 
puts with smaller magnitudes. 
Alternatively, q can be chosen to be adaptive. Specifically, 
(6) 
= I ,1' i 
where ~0 is a small positive constant. Thus 
I~w,~l = ~--71~,1 Iz~l, and ~ IAw,~l = ~01~,1. (7) 
J 
As a result, the total change in the connection weights to 
a hidden unit is independent of the input, ~, thus allowing 
the training procedure to pay similar attention to all input 
vectors. 
Boundary Classification 
In our segmental framework formulated in Equation i, the 
main difference between classification and recognition is the 
incorporation of a probability for each segment, p(s). As de- 
scribed previously in Equation 2, we have simplified the prob- 
lem of estimating p(s) to one of determining the probability 
that a boundary exists, p(b). Currently in the SUMMIT sys- 
tem, boundary classification is based on the height attained 
by a boundary in the dendrogram. A small VQ codebook of 
size 12 is used to quantize the spectral average of each seg- 
ment, and distributions are collected and parameterized for 
each possible context. 
Since one of the segment networks we considered was not 
based on a dendrogram, an alternative classification proce- 
dure for the boundaries was adopted. In this procedure, a 
MLP with two output units was used, one for the valid bound- 
aries and the other for the extraneous boundaries. By ref- 
erencing the time-aligned phonetic transcription, the desired 
outputs of the network can be determined. In our current im- 
plementation, the probability of a detected boundary, p(b~), is 
determined using four abutting segments. Let ti stand for the 
time at which b~ is located, and si stand for the segment be- 
tween t~ and t~+1, where ti+t > ti. The boundary probability, 
p(bi), is then determined by using the acoustic measurements 
in s,_2,si-l,s~, and si+l as inputs to the MLP. 
Data Representation 
There were two representations used as input for the MLP 
classifier. The first representation was identical to the SUM- 
MIT system, and consisted of 82 acoustic attributes. The 
attributes were generated automatically by a search proce- 
dure that uses the training data to determine the settings of 
free parameters of a set of generic property detectors using an 
optimization procedure\[10\]. The second representation con- 
sisted of a vector of three average spectra which corresponded 
to the left, middle, and right thirds of a segment. Th e spectra 
were the mean-rate and synchrony outputs of a 40 channel au- 
ditory model \[Ii\]. Thus, there were 120 points used for each 
representation. Finally, segment duration was also included. 
Experimental Results 
Phonetic Classification 
The first experiments which were performed were based 
on phonetic classification. In these tests the system classified 
a token taken from a phonetic transcription that had been 
aligned with the speech waveform. Since there was no de- 
tection involved in these experiments only substitution errors 
were possible. As has been reported previously, the baseline 
speaker-independent classification performance of su M MIT on 
the testing data was 70% \[12\]. The performance of the MLP 
classifier using the same input representation was 74%. In 
the second set of classification experiments the representation 
was based on the spectral outputs described previously. Four 
experiments were performed using i) the synchrony outputs, 
2) the mean-rate outputs, 3) the synchrony and mean-rate 
outputs and 4) the synchrony and mean-rate outputs and 
segment duration. The results of all experiments have been 
summarized in Table 3. None of these alternative represen- 
tations was able to achieve the same level of performance as 
the SUMMIT acoustic attributes. 
Boundary Classification 
We have evaluated the MLP boundary classifier using the 
training and testing data described earlier. The inputs to the 
network are the averages of the mean rate response in the four 
abutting segments, resulting in a total of 160 input units. By 
using 32 hidden units, the network can classify 87% of the 
boundaries in the test set correctly. 
384 
