 vIicrophone Arrays and Neural Networks for Robust 
Speech Recognition 
C. Che +, Q. Lin +, J. Pearson*, B. de Vries*, and J. Flanagan + 
+CAIP Center, Rutgers University, Piscataway, NJ 08855-1390 
and 
*David Sarnoff Research Center, Princeton, NJ 08543-5300 
ABSTRACT 
This paper explores use of synergistically-integrated systems 
of microphone arrays and neural networks for robust speech 
recognition in variable acoustic environments, where the user 
must not be encumbered by microphone equipment. Existing 
speech recognizers work best for "high-quality close-talking 
speech." Performance of these recognizers is typically de- 
graded by environmental interference and mismatch in train- 
ing conditions and testing conditions. It is found that use of 
microphone arrays and neural network processors can elevate 
the recognition performance of existing speech recognizers in 
an adverse acoustic environment, thus avoiding the need to 
retrain the recognizer, a complex and tedious task. We also 
present results showing that a system of microphone arrays 
and neural networks can achieve a higher word recognition 
accuracy in an unmatched training/testing condition than 
that obtained with a retrained speech recognizer using array 
speech for both training and testing, i.e., a matched train- 
ing/testing condition. 
1. INTRODUCTION 
Hidden Markov Models (HMM's) have to date been accepted 
as an effective classification method for large vocabulary con- 
tinuous speech recognition. Existing HMM-based recognition 
systems, e.g., SPHINX and DECIPHER, work best for "high- 
quality close-talking speech." They require consistency in 
sound capturing equipment and in acoustic environments be- 
tween training and testing sessions. When testing conditions 
differ from training conditions, performance of these recog- 
nizers is typically degraded if they are not retrained to cope 
with new environmental effects. 
Retraining of HMM-based recognizers is complex and time- 
consuming. It requires recollection of a large amount of 
speech data under corresponding conditions and reestimation 
of HMM's parameters. Particularly great time and effort are 
needed to retrain a recognizer which operates in a speaker- 
independent mode, which is the mode of greatest general in- 
terest. 
Room reverberation and ambient noise also degrade per- 
formance of speech recognizers. The degradation becomes 
more prominent as the microphone is positioned more dis- 
rant from the speaker, for instance, in a teleconferenc- 
ing application. Previous work has demonstrated that 
beamforming/matched-filter microphone arrays can provide 
higher signal-to-noise ratios than can conventional micro- 
phones used at distances (see, e.g., \[1, 2\]). Consequently, 
there is increasing interest in microphone arrays for hands- 
free operation of speech processing systems \[3\]-\[7\]. 
In this report, a system of microphone arrays and neural net- 
works is described which expands the power and advantages 
of existing ARPA speech recognizers to practical acoustic en- 
vironments where users need not be encumbered by hand- 
held or body-worn microphone systems. (Examples include 
Combat Information Centers, large group conferences, and 
mobile hands-busy eyes~busy maintenance tasks.) Another 
advantage of the system is that the speech recognizer need 
not be retrained for each particular application environment. 
Through neural network computing, the system learns and 
compensates for environmental interference. The neural net- 
work transforms speech-feature data (such as cepstrum co- 
efflcients) obtained from a distant-talking microphone array 
to those corresponding to a high-quality, close-talking micro- 
phone. The performance of the speech recognizer can thereby 
be elevated in the hostile acoustic environment without re- 
training of the recognizer. 
The remainder of the paper is organized as follows. First, 
a new speech corpus with simultaneous recording from dif- 
ferent microphones is described in Section 2. Next, the 
system of microphone arrays and neural networks is dis- 
cussed in Section 3. The system is evaluated using both the 
SPHINX speech recognizer and a Dynamic-Time-Warping 
(DTW) based recognizer. The results are presented in Sec- 
tions 4 and 5, respectively. In Section 6, performance compar- 
isons are made of different network architectures to identify 
an optimal design for room-acoustic equalization. Finally, we 
summarize the study and discuss our future work in Section 
7. 
2. SPEECH CORPUS 
A speech database has been recently created at the CAIP 
Center for evaluation of the integrated system of microphone 
arrays, neural networks, and ARPA speech recognizers. The 
database consists of 50 male and 30 female speakers. Each 
speaker speaks 20 isolated command-words, 10 digits, and 10 
continuous sentences of the Resource Management task. Of 
the continuous sentences, two are the same for all speakers 
and the remaining 8 sentences are chosen at random. Two 
recording sessions are made for each speaker. One session 
is for simultaneous recording from a head-mounted close- 
talking microphone (HMD 224) and from a 1-D beamforming 
line array microphone (see Section 3.1). The other is for si- 
multaneous recording of the head-mounted close-talking mi- 
crophone and a desk-mounted microphone (PCC 160). The 
recording is done with an Ariel ProPort with a sampling fre- 
342 
(A) 
2 (~. l + I I 
-10000 PIt ' ~ + qlpl+- ........ ; .............. • I I 
0.5 1 1.5 2 
(a) x10 * 
0.5 1 1,5 2 
(C) xlO+ 
'O~I't°00 ' ,l ~ '~"1"~'~'"i,,~ ~ "" t -1 .... , ,,~ t '~mm'; ''+'" .............. ; 
0.5 1 1.5 2 
(D) x 104 
0,5 1 1.5 
x10 ~ 
Figure 1: Speech wavefozms fzom the head-mounted micro- 
phone ( A and O), ~om the 1-D line array microphone (B), 
and from the desk-mounted microphone (D). CA) and (B) 
are simultaneously zecozded in a session and (C) and (D) in 
a following session. The utterance is: "Miczophone arzay," 
spoken by a male speaker !ABF). 
100 oo • /I .ol 
I+wl 
I FEM~JRS EXIRACT~ 
I 
AMPA 
Larlle.focabulaflr 
ma,am., q~,,,,d~ 
m 
.,~.,~b~ / 
(~Sl~LIII ¢~RqC'IIBI/S OF 
CLOSGoTAL~,INQ 91~I~ECH 
m 
)0 ° 
),, |° 
Figuze 2: Block diagram of the robust speech zecognition 
system. The neural network processor is trained using si- 
multaneously recorded speech. The trained neural network 
processor is then used to transform spectral features of array 
input to those appropriate to close-talking. The transformed 
spectral featuzes are inputs to the speech recognition system. 
No retraining or modification of the speech recognizer is nee- 
essary. The training of the neural net typically zequires about 
10 seconds of signal. 
corporating microphone arrays, neural networks, and ARPA 
speech recognizers. 
quency of 16 kHz and 16-bit linear quantization. The record- 
ing environment is a hard-walled laboratory room of 6 × 6 × 2.7 
meters, having a reverberation time of approximately 0.5 sec- 
ond. Both the desk-mounted microphone and the line axray 
microphone are placed 3 meters from the subjects. Ambient 
noise in the laboratory room is from several workstations, 
fans, and a large-size video display equipment for telecon- 
fereneing. The measured 'A' scale sound pressure level is 
50 dB. Indicative of the quality differences in outputs f~om 
various sound pickup systems, signal waveforms are given 
in Figure 1. Because of wave propagation from the speaker 
to distant microphones, a delay of approximately 9 msec is 
noticed in outputs of the line array and the desk-mounted 
microphone. Wave propagation between the subject's lips 
to the head-mounted close-talking microphone is negligible. 
The reader is referred to \[8\] for more details. 
3. SYSTEM OF MICROPHONE 
ARRAYS AND NEURAL 
NETWORKS 
Figure 2 schematically shows the overall system design for ro- 
bust speech recognition in variable acoustic environments, in- 
3.1. Beamforming Microphone Arrays 
As the distance between microphones and talker increases, 
the effects of room reverberation and ambient noise be- 
come more prominent. Previous studies have shown that 
beamforming/matched-fllter array microphones are effective 
in counteracting environmental interference. Microphone ar- 
rays can improve sound quality of the captured signal, and 
avoid hand-held, body-worn, or tethered equipment that 
might encumber the talker and restrict movement. 
The microphone array we use here is a one-dimensional beam- 
forming line array. It uses direct-path arrivals to produce a 
slngle-beam delay-and-sum beamformer \[1, 2\]. (The talker 
typically faces the center of the llne array.) The array con- 
sists of 33 omni-direetlonal sensors, which are nonuniformly 
positioned (nested over three octaves). From Figure 1 it is 
seen that the wavefozm of the array resembles that of the 
close-talking microphone more than the desk-mounted mi- 
crophone. 
3.2. Neural Network Processors 
One of the neural network processors we have explored, is 
based on multi-layer perceptrons (MLP). The MLP has 3 lay- 
343 
TRAINING USING THE STANDARD BACKPROPAGATION 
ONE HIDDEN LAYER WITH 4 SIGMOID NEURONS 
I-l • ~rr~.AcTo~ (:zrI.$.~I 
I- i 
Figure 3: A feedforward delay network for mapping the cep- 
stral coefficients of array speech to those of close-talking 
speech. 
ers. The input layer has 9 nodes, covering the current speech 
frame and four preceding and four following frames, as indi- 
cated in Figure 3. There are 4 sigmoid nodes in the hidden 
layer and 1 linear node in the output layer. 13 such MLP's 
are included, with one for each of the 13 cepstrum coefficients 
used in the SPHINX speech recognizer \[14\]. (Refer also to 
Figure 2.) The neural network is trained using a modified 
backpropagation method when microphone-array speech and 
close-talking speech are both available (see Figure 3). 
It is found that 10-seconds of continuous speech material are 
sufficient to train the neural networks and allow them to 
"learn" the acoustic environment. In the present study, the 
neural nets are trained in a speaker-dependent mode; That 
is, 13 different neural networks (one for each cepstrum coeffi- 
cient) are dedicated to each subject 1. The trained networks 
are then utilized to transform cepstrum coefficients of array 
speech to those of close-talking speech, which are then used 
as inputs to the SPHINX speech recognizer. 
4. EVALUATION RESULTS WITH 
SPHINX RECOGNIZER 
As a baseline evaluation, recognition performance is mea- 
sured on the command-word subset of the CAIP database. 
Performance is assessed for matched and unmatched test- 
ing/training conditions and include both the pretrained and 
retrained SPHINX system. 
The results for the pretrained SPHINX are given in Table 1. 
It includes four processing conditions: (i) close-talking; (ii) 
line array; (ili) line array with mean subtraction (MNSUB) 
\[15\]; and, (iv) line array with the neural network processor 
(NN). 
Table 2 gives the results for the retrained SPHINX under 
five processing conditions: (i) close-talking; (fi) line array; 
1 The learning rate is 0.01 and the momentum term is 0.5. The 
tr,~i~i~.g terminates at I000 epocch~s. 
Testing Microphone 
Line-Array 
Line-Array +MNSUB ~3 
Word Accuracy 
88% 
16% 
24% 
82% 
Table 1: Baseline evaluation of recognition performance (% 
correct), using the pretrained SPHINX speech recognizer. 
(iii) desk-mounted microphone; (iv) line array with mean 
subtraction (MNSUB); and, (v) line array with the neural 
network processor (NN). The SPHINX speech recognizer is 
retrained using the CAIP speech corpus to eliminate system 
conditions in coUection of the Resource Management task (on 
which the original SPHINX system has been trained) and the 
CAIP speech database. 
As shown in Tables 1 and 2, the array-neural net system is 
capable of elevating word accuracy of the speech recognizer. 
For the retrained SPHINX, the microphone array and neural 
network system improves word accuracy from 21% to 85% for 
distant talking under reverberant conditions. On the other 
hand, the mean subtraction method under these conditions 
improves the performance only marginally. 
It is also seen from Table 2 that if the SPHINX system has 
been retrained with array speech at a distance of 3 meters, the 
performance is as high as 82%. The figure, obtained under a 
matched training/testing condition, is, however, lower than 
that obtained under an unmatched training/testing condition 
with microphone array and neural network. Similar results 
have been achieved for speaker identification \[9, 10\]. 
5. EVALUATION RESULTS WITH 
DTW RECOGNIZER 
To more effectively and efficiently assess the capability of 
microphone arrays and neural network equalizers, a DTW- 
based speech recognizer is implemented \[12\]. The back end 
of DTW classification is simple, and hence, the results do not 
tend to be influenced by the complex back end of an HMM- 
based recognizer, including language models and word-pair 
grammars. 
Testing Training Training 
Close-Talking Line-Array 
Close-Talking 
Line-Array 
Desk-mounted 
Line-Array + MNSUB 
Line-Array + NN 
95% 
21% 
13% 
27% 
85% 
82% 
Table 2: Baseline evaluation of recognition performance (% 
correct), using a retrained SPHINX recognizer based on the 
CAIP speech database. 
344 
11111 
8C 
<6C L2 
5C 
3C , 1000 2000 3000 4000 5000 
Number of Iterations (Epochs) 
Figure 4: Word recognition accuracy on the testing set as a 
function of the number of iterations when training the neural 
network processor. 
The DTW recognizer is applied to recognition of the com- 
mand words. End-points of close-talking speech are automat- 
ically determined by the two-level approach \[11\] 2. Attempts 
have also been made to automatically detect end-points of ar- 
ray speech \[13\], but in the present paper, the starting/ending 
points are inferred from the simultaneously recorded close- 
talking speech, with an additional delay resulting from wave 
propagation. The DTW recognizer is speaker dependent, 
and is trained using close-talking speech. The measured fea- 
tures are 12th-order LPC-derived cepstral coefficients over a 
frame of 16 msec. The frame is Hamming-windowed and the 
consecutive windows overlap by 8 msec. The DTW recog- 
nizer is tested on array speech (with the originally computed 
and neural-network corrected cepstral coefficients) and on the 
other set of the close-talking recording. The Euclidean dis- 
tance is utilized as the distortion measure. 
The recognition results, pooled over 10 male speakers, are 
presented in Table 3. The configuration of MLP used in this 
DTW based evaluation differs from that in Section 4. A single 
MLP with no window-sliding is now used to collectively trans- 
form all of 12 cepstral coefficients from array speech to cclose- 
talking. The MLP has 40 hidden nodes and 12 output nodes. 
The network is again speaker-dependently trained with stan- 
dard backpropagation algorithms. The learning rate is set to 
0.1 and the momentum term to 0.5. The backpropagation 
procedure terminates after 5000 iterations (epochs). 
It can be seen that the results in Table 3 are similar to those 
in Tables 2 and 1. The use of microphone arrays and neural 
networks elevates the DTW word accuracy from 34% to 94% 
under reverberant conditions. The elevated accuracy is close 
to that obtained for close-talking speech (98~0). 
2The automatic results conform well with manual editing. 
Close-Talking 
Line-Array 
Line-Array + NN 
Word Accuracy 
98% 
34% 
94% 
Table 3: Baseline evaluation of recognition performance using 
DTW classification algorithms. 
Figure 4 illustrates the relationship between the number of 
training iterations of the neural networks and the word recog- 
nition accuracies. It is seen that as the iteration number in- 
creases from 100 to 1000, the recognition accuracy qnickiy 
rises from 32% to 87%. It can also be seen that after 5000 
iterations the network is not overtralned, since recognition 
accuracy on the testing set is still improving. 
6. PERFORMANCE COMPARISON 
OF DIFFERENT NETWORK 
ARCHITECTURES 
We also perform comparative experiments with respect to 
different network architectures. It has been suggested in the 
communications literature that recurrent non-linear neural 
networks may outperform feedforward networks as equaliz- 
ers. Since our problem can be interpreted as a room acous- 
tics equalization task, we decide to evaluate the performance 
of recurrent nets. For the experiments reported here, we 
only train on data from the 3rd eepstral coefficient (out of 
13 bands). The input to the neural net is the cepstral data 
from the microphone array; the target cepstral coefficient is 
taken from the close-talking microphone. The squared error 
between the target data and the neural net output is used as 
the cost function. The neural nets are trained by gradient de- 
scent. The following three different architectures have been 
evaluated: (i) a linear feedforward net (adaline) \[16\], (i.i) a 
non-linear feedforward net, (iii) and a non-linear recurrent 
network. The input layer of all nets consisted of a tapped de- 
lay line. The network configurations are depicted in Figures 
5 and 6. 
Experimental results are summarized in Table 4, where the 
entry "nflops/epoch" stands for the number of (floating 
point) operations required during training per epoch. The 
entry "#parameters" holds the number of adaptive weights 
in the network. 
It is clear that, for this dataset, the non-linear networks per- 
form better than the linear nets, but at the expense of consid- 
erably more computations during adaptation. This is not a 
problem if we assume that the transfer function from speaker 
to microphone is constant, but in a changing environment 
(moving speaker, doors opening, changing background noise) 
this is a problem, as the neural net needs to track the change 
in real-time. It should be noted that the used cost function, 
the squared error, is in all likelihood not a monotonic func- 
tion of the recognizer performance. Currently experiments 
are underway that evaluate the performance of various net- 
work architectures in terms of word recognition accuracy. 
345 
Figure 5: The feedforward net. The hidden units are non- 
linear (tanh). 
recto'rent 
I ~ ~ I hidden layer 0 
Figure 6: The recurrent network has a similar structure as 
the 2-layer feedforward. 
architecture flnalsqe nflops/epoch #parameters adaptation rule 
no processing .12 
adaliine (I tap) .0952 ', 14,000 1 delta rule 
adaline (5) .0844 - 40,000 (1) 5 delta 
adaline (11) .0825 ', 80,000 (2) 11 delta i 
~dnet(5~,l) .0707 "1924,000(48) I 15 backprop 
recnet (5,2r,1) .0782 - 2478500 (62) 19 bptt 
.0775 '. 3772°000 f94) i ffwdnet ($,4,1) -37n,ooo(94) 29 backprop 
Figure 7: Experimental results of different neural network 
configurations. The various runs are ordered by increasing 
performance. Final sqe (squared error) is the mean sqe per 
time step on the test database. The ops/epoch denotes the 
number of floating point operations per epoch during train- 
ing. The number in brackets is the number of flops per epoch 
divided by flops/epoch for adaline (5 taps). ~ parameters 
denotes the number of adaptive parameters in the network. 
7. CONCLUSION AND 
DISCUSSION 
The above evaluation results suggest that the system of mi- 
crophone array and neural network processors can 
• effectively mitigate environmental acoustic interference 
• without retraining the recognizer, elevate word recog- 
nition accuracies of HMM-based and/or DTW-based 
speech recognizers in variable acoustic environments to 
levels comparable to those obtained for close-talking, 
high-quality speech 
• achieve word recognition accuracies, under unmatched 
training and testing conditions, that exceed those ob- 
tained with a retrained speech recognizer using array 
speech for both retraining and testing, i.e., under em 
matched training and testing conditions 
Similar results have also been achieved for studies on speaker 
recognition \[9, 10\]. 
In future work, we expect to extend the comparative eval- 
uations of different neural network architectures, so that 
the performance of neural network equalization can be ad- 
dressed in terms of word recognition accuracy. We also want 
to extend the evaluation experiments to continuous speech. 
For comparison, the DECIPHER system will be included, 
and possibly other advanced ARPA speech recognizers. The 
CAIP Center has concomitant NSF projects on developing 
2-D and 3-D microphone arrays. These new array micro- 
phones have better spatial volume selectivity and can pro- 
vide a high signal-to-noise ratio. They will be incorporated 
into this study. Further work will compare the system of mi- 
crophone array and neural network with other existing noise 
compensation algorithms, such as Codebook Dependent Cep- 
strum Normalization (CDCN) \[17\] and Parallel Model Com- 
bination (PMC) \[18\]. 
8. ACKNOWLEDGMENT 
This work is supported by ARPA Contract No: DABT63-93- 
C-0037. The work is also in part supported by NSF Grant 
No: MIP-9121541. 
References 
1. Flanagan, J., Berkley~ D, Elko, G., West, J, and 
Sondhi, M., "Autodirective microphone systems," Acus- 
tica 73, 1991, pp. 58-71. 
2. Flanagan, J., Surendran, A., and Jan, E., "Spatially se- 
lective sound capture for speech and audio processing," 
Speech Communication, 13, Nos. 1-2, 1993. pp. 207-222. 
3. Silverman, H. F., "Some analysis of microphone arrays 
for speech data acquisition," IEEE Trans. Acous. Speech 
Signal Processing 35, 1987, pp. 1699-1712. 
4. Berldey, D. A. and Flanagan, J. L., "HuMaNet: An 
experimental human/machine communication network 
based on ISDN," AT ~4 T Tech J., 1990, pp. 87-98. 
5. Che, C., Rahim, M, and Flanagan J. "Robust speech 
recognition in a multimedia teleconferencing environ- 
meat," \]. Acous. Soc. Am. 9Z (4), pt 2, p. 2476(A), 
1992. 
346 
6. Sullivan, T. and Stern, R. M., "Multi-microphone 
correlation-based processing for robust speech recogni- 
tion," Proc. ICASSP-93, April, 1993. 
7. Lin, Q., Jan, E., and Flanagan, J., "Microphone-arrays 
and speaker identification," Accepted for publication in 
the special issue on Robust Speech Processing of the 
IEEE Trans. on Speech and Audio Processing, 1994. 
8. Lin, Q., C. Che, and It. Van Dyek: "Description of CAIP 
speech corpus," CAIP Technical Report, Rutgers Uni- 
versity, in preparation, 1994. 
9. Lin, Q., Jan, E., Che, C., and Flanagan, J., "Speaker 
identification in teleconferencing environments using mi- 
crophone arrays and neural networks," Proc. of ESCA 
Workshop on Speaker, Recognition, Identification, and 
Verification, Switzerland, April, 1994. 
10. Lin, Q., Che, C., Jan, E., and Flanagan, J., 
"Speaker/speech recognition using microphone arrays 
and neural networks," paper accepted for the SPIE Con- 
ference, San Diego, July, 1994. 
11. Rabiner, L. and Sambur, M. "An algorithm for deter- 
mining the endpoints of isolated utterances," Bell Syst. 
Tech. J. 54, No. 2, 1975, pp. 297-315. 
12. Sakoe, H. and Chiba, S. "Dynamic programming opti- 
mization for spoken word recognition." IEEE Trans. on 
Acous. Speech Signal Processing P6, 1978, pp. 43-49. 
13. Srivastava, S., Che, C., and Lin, Q., "End-point de- 
tection of microphone-array speech signals," Paper ac- 
cepted for 127th meeting of Acous. Soc. of Amer., 
Boston, June, 1994. 
14. Lee, K.-F., Automatic Speech Recognition: The Develop- 
ment Of The SPHINX System, Kluwer Academic Pub- 
fishers, Boston, 1989. 
15. Furui, S. "Cepstral analysis technique for automatic 
speaker verification," IEEE Trans. Acoustics, Speech 
and Signal Processing, Vol 29, 1979, pp 254--272. 
16. de Vries, B., "Short term memory structures for dy- 
namic neural networks," to appear in Artificial Neural 
Networks with Applications in Speech and Vision (Ed. 
R. Mammone). 
17. Liu, F.-H., Acero, A., and Stern, R. M., "Efficient joint 
compensation of speech for the effect of additive noise 
and linear filtering," ICASSP-92, April, 1992, pp 25% 
260. 
18. Gales, M. J. F. and Young, S. J., "Cepstral parameter 
compensation for HMM recognition," Speech Communi- 
cation 12, July, 1993, pp. 231-239. 
347 
