An t t i I iron en 
University of Oulu 
Inslitute of Phonetics 
AUTOMATIC RECOGNITION OF SPEECH SOUNDS BY A DIGITAL 
CO~UTER 
Three contributions concernin~ the discrimination of 
the momentan spectrums of some selected Finnish and 
German sounds 
The main difficulties in the speech recognition 
may be listed in the following way: 
lo Which should be the basic linEuistic units to be re- 
cognized: sounds (allophones) t phonemes t segment combi- 
nations I syllables I words? 
2e Should the output text be written ortograph~cally? 
How then the problem of the differences between the 
phonemic form of an utterance and the ortography should 
be resolved? 
3, If the word is chosen as basic units for the recogni- 
tion~ how one should resolve the problem of the grammatical 
flexion (e. E. in Finnish)? 
~o How can the recognition automation decide r where there 
is a boundary between two words or two sentences? 
5° How can the automation decide that eeE. the pause 
durin E a lon E voiceless stop consonant is not a boundary? 
6e How can the automation discriminate the tonal and 
croneme classes in laEuaEes 9 in which they are linEuisti- 
cally relevant? 
7e The automation should not take into account the irrele- 
vant " - nolse$ one must reEard also the noise produced by 
the automation itself° 
8. How to localize the points in the speech continuum t 
which the recognition can be based on~ is there one. 
special acoustic segment (or a momentan spectrum) for 
every sound t which is characteristic, for the sound? 
9. It has been shown that segments, which are 1 i n g u - 
i s t i c a 1 1 y i d e n t i c a I t can be acoustically 
different. The differences are due to followin E factors: 
(I) The same speaker can not produce two exact similar 
sounds, because the conception of the identity is a human 
abstraction. (2) Different speakers produce linguistically 
the same sound in a different way. (~) Linguistically the. 
same. so~d can be modified acoustically by the word promi- 
nence, sentence prominence I environment, emotional factors t 
speech tempo I dialectal background of the speaker t speech 
defects t huskiness t and so on.e 
Io. L i n g u i s t i c a 1 1 y d i f f e r e n t sounds 
can be acoustically similaro 
11. Should the phonotactic Structures (Sigurd) or the 
characteristic sequencies (Pike) of a language be regarded 
when creating the recognition program?. 
12. The technical problems form one great part of the 
speech recognition. They concern the m e c h a n i c a 1 
s o 1 u t i o n s and the r e c o g n i t i o n p r o - 
g r a me 
lo Vowel reco~rnition based on some selected vowel variables 
and discriminant analysis. 
i 
The probability of correct identification of the 
acoustically close German vowel phonemes /i: t It e:~ ~ t 
y:, and Y~ on the basis of spectrographic input data and 
tile discriminant analysis (literate I, 2rand ~) was calcu- 
lated. One male speaker were used. Following variables were 
measured: the frequencies of the four first formants. (Fie.. 
F~)I their amplitudes (Lle.eL~) t the amplitude Of the zero 
(minimum) point between F1 and F2 (here called LZI) and that 
between F2 and F3 (LZ2) t and the duration of the vowels. 
The probability of correct identification was 94 per 
cent on average. The highest identification probability 
was shown by the phoneme /e:/ (98,9 ~) and the lowest by 
the phoneme /Y/ (85~7 ~). The sounds were picked up from 
sentences read by the informant. 
In the real classification procedure which was connec- 
ted to the probabilistic recognition program 6 identifi- 
cations were false out of 103 possible. The order of the 
significance of the variables studied regarding their 
discriminatory power was F2 s LZI~ F1 t F~ s duration s LI~ F3s 
L~ LZ21L3. - One must take into account the possibili- 
ty that two variables 9 the discriminatory power of which 
is good~ will correlate with each other. In this case the 
better one is placed in a high position in the list I but 
the other one comes later than its real discriminatory 
power implies~ because the correlation is taken into ac- 
count. If the better variable was. not considered~ the 
weaker variable would perhaps take its place (if the 
correlation is strong enough). This may explain the fact 
that F3 comes after F~ (the correlation of F2 with F3 is 
strong concerning the vowels studied). 
The energy minimum between F1 and F2 (LZI) had a good 
discriminatory power. This showes that in the acoustic 
signal there can be cues~ which are available in the auto- 
matic recognition s such cues~ which need not to be relevant 
for perception (cf. Tillmann~ p. 1~9)° 
2. Recognition based on the discrimination of the numerical 
models of sounds. 
In the second experiment the input data of the recog- 
nition program consisted of the numerical describers of the 
sounds. They were formed by using constant points in the 
measurement of the spectrums of sounds. Thus the describer 
of a sound consisted of a serie of numbers~ which indicated 
the amplitude at constant selected frequencies. The narrow 
filter (with 45 Hz bandwidth) was used when producing the 
sections r which formed the material measured. 32 measure- 
ment points inside the range of ~ kHz were used. 
The describers for 330 Finnish sound manifestations 
were calcula'ted. These sounds were representatives for 
8 short Finnish vowel or 3 nasal phonemes /a, e, i 0 o, u, 
y~ a r o r m r n, n/. 30 representatives of every phoneme 
type were picked up from sentences read by a single male 
speaker. 
The data thus obtained were stored and submitted 
to the discriminating analysis. The measurement points were 
handled as variables. 
The probability of correct recognition was about 
60°.°70 % on average. One must regard r however r that the 
localization of the sections was (under circumstances) not 
very exact and the technical equipment was unfortunately 
not the best one. 
30 Recognition based on the numerical models of sounds and 
a special recognition program.. 
In the third recognition experiment the Finnish 
nasal sounds belonging to the phonemes /n/ or /m/ were 
tried to be classified automatically on basis of the 
numerical describers, which are discussed in the preferring 
J chapter. 
Firstly the frequency area of ~ kHz was studied by 
means of 33 constant measurement points with distances of 
121HZo The 'general' describers for /n/ and /m/ were calcu- 
lated by means of the PROGRAM I (below). 
The basic material consisted of 87 wide hand sections 
(made with Kay Electric Co. Sound Sona-Graph model 6061-B)o 
The sections were made from the target point of F2 of the 
nasals in single words (all possible enviroD-ments were 
considered). The describers of /n/ and /m/ are presented 
graphically in fig. Io The influence of the environment 
on the dental nasals (n) seems not to be very great (fig, 2)o 
one male speaker (Finnish) 
db 
15-- 
lO- 
S- 
O-- 
in/ 
I I 
, 3 d 
Fig. 1 
Models of /n/ 
and /m/ phonemes. 
50 + 57 wide 
band sections 
were used 
one male speaker (Finnish) 
db 
15-- 
10-- 
S-- 
O- 
"X Inl 
combined 
-~ with a front 
"...~ . vowel 
":~ with a back • 
~ .-"%°. ,| 
Fig, 2 
• Models of /n/ 
in different 
environments. 
Wide band 
sections were 
used, 
Secondly the numerical describers were restricted so 
that only nine constant measurement points were considered. 
The nine points with the best discriminatory power were 
sought by means of the PROGRAM II (below). 
Thirdly the 'general' numerical models for the both 
phonemes were calculated on basis of the nine points 
mentioned. The logic of the procedure is described short- 
ly at the beginning of the program (PROGRAM III). 
With the same method the numerical model of a new 
nasal sound was calculated (PROGRAM III), and the nasal 
sound was classified by compairing its model with the 
mean of the models of /n/ and /m/. 
The main idea of classification is that the amplitudes 
at the nine measurement points are set on order of magni- 
tude, and then their relative places on the frequency axis 
are indicated by means of the ordinal numbers (nine possi- 
bilities). The ordinal numbers are then placed one after 
another9 so that they form one single number. This number 
iis handled as the numerical model of a group of nasal 
sounds or a single nasal sound. 
The classification time of a sound by means of method 
described here is only a fraction of that when using the 
discrimination analys~s. 
Final comments 
Every language needs its own recognition program con- 
sisting of subprograms~ which can be very different. That 
the recognition program can be worked out implies that there 
is ~ sufficient amount of acoustic knowledge about the 
language in question. 
It is possible that the complete speech recognition 
doesn't succeed with the computers available~ so that we 
must waite so long that the biological computers are at 
our disposal. (contin. after the programs) 
PROGRAM I (programming language FORTRAN II) 
C COMPUTATION OF THE GENERAL MODELS FOR N GROUPS OF 
C SOUNDS: CALCULATE THE MEAN SETS FOR THE GROUPS° 
C MATERIAL CONSISTS OF MEASUREMENT VALUES AT 33 
C CONSTANT MEASUREMENT POINTS ON THE FREQUENCY 
C AXIS OF EVERY SOUND° 
C UNIVERSITY OF OULUgFINLAND 
C INSTITUTE OF PHONETICS 
C 
DIMENSION IA~PLI(33),N~iBER(33),ISUM(53) 
DIMENSION AMEAN(333 
WRITE(3,222) 
222 FORMAT('I',' ') 
IGROUP=O 
~O1 DO 300 I=I~33 
ISUM(I)=O 
300 NUMBER(I)=O 
1READ(I,lO)(IAMPLI(I),I=1,33) 
10 FOP.MAT(5312) 
DO 200 I=1,33 
IF(IAMPLI(I)-36.O0000)3,~,3 
3 NUMBER(1)=NUMBER(1)+I 
ISUM(I)=ISUM(I)+IAMPLI(I) 
200 CONTINUE 
GO TO 1 
5 DOIO0 I=I,33 
AMEAN(1)=ISUM(1)/NUMBER(I) 
100 CONTINUE 
IGROUP=IGROUP+I 
WRITE(3,333)IGROUP 
333 FORMAT('O','GROUP',TS,I~) 
WRITE(3,11)(A~IEAN(I),I=I,17) 
11FO~AT(' 't'MEANS'IT10~lTFS.I) 
W'RITE(5,12)(AMEAN(1),I=18t35) 
12 FORMAT(' '~TIO~16FS.I) 
GO TO 4ol 
END 
The last card in a group of sounds: 999999999999...99 
The last card in the program: 3636363636...36 
The greatest possible value of variables (IAMPLI): 35 
PROGRAM II 
C 
C C 
C 
C 
C 
C 
C 
6O 
61 
SEEK THE NINE BEST DISCRIMINATING POINTS ON THE 
FREQUENCY AXIS OF THE N AND M SOUNDS. USE THE 
NUMERICAL DESCRIBERS OF N AND M FOR~IED BY bIEANS OF 
THE PROGRAM I. 
DIMENSION AMEANN(33),AMEANM(33),ASQUAR(33),DIFF(33) 
DIMENSION BSQUAR(33),NUM(33) 
CALCULATE THE DIFFERENCES OF THE DESCRIBERS OF N 
AND M. 
: It is assumed that the describers of /n/ 
:o and /m/ are stored before& they are called 
: AMEANN and AMEAN~I. 
DO 60 I=1,35 
DIFF(1)=AMEANN(I)-AMEANM(I) 
CONTINUE 
DO 61 I=1,33 
ASQUAR(I)=DIFF(I)--2 
CONTINUE 
SET THE AMPLITUDE DIFFERENCES IN ORDER OF ~G/qITUDE 
DO %21 N=I,55 
B S QUA R (M) =AS QUAR (M) 
421 CONTINUE 
423 DO %24 I=1,32 
II=I+l 
DO 42% N=I1,33 
IF(ASQUAR(1)-ASQUAR(N)425,~2%,42% 
~25 AUX=ASQUAR(N) 
ASQUAR(N)=ASQUAR(I) 
ASQUAR(I)=AUX 
424 CONTINUE 
7 
2 
%50 
1% 
INDICATE THE ORDINAL NUMBERS OF THE POINTS MEASURED 
IN ORDER OF DISCRIMINATING POWER 
DO ~30 I=1,53 
IORDER=O 
DO 2 M=1,33 
IORDER=IORDER+I 
IF(ASQUAR(I)-BSQUAR(M))7,7,2 , 
NUM(I)=IORDER 
BSQUAR(M) =-9999999.O 
GO TO %30 
CONTINUE 
CONTINUE 
WRITE(3,14) (NUM(L),L=I,33) 
FORMAT('0','ORDINAL NUNBERS'~T20~3513) 
CONTINUE 
END 
PROGRAM III 
C AUTOMATIC DISCRIMINATION OF N AND M 
C UNIVERSITY OF 0ULU FINLAND 
C INSTITUTE OF PHONETICS 
C 
C LOGIC OF THE PROGP, A~i: 
C 1:CALCULATE THE MEANS OF THE AMPLITUDES AT THE NINE 
C MEASUREMENT POINTStWHICH ARE THE MOST DISCRIMINATING 
C POINTS ON THE FREQUENCY AXIS FOR N AND M~. 
C 2:SET THE AMPLITUDES IN ORDER OF MAGNITUDE! 
C 3 :INDICATE THE ORDINAL NUMBERS OF THE AMPLITUDES ! 
C 4:FORM THE GENERAL NUMERICAL MODEL FOR N AND M 
C ON BASIS OF THE ORDINAL NU~BERS! 
C 5:CALCULATE THE MODELS OF NEW NASAL SOUNDS WITH 
C THE SAME METHOD ! 
C RESOLVE THE PROBLEM:IS THE NEW NASAL SOUND A N OR 
C A M? CO~LPAIR ITS MODEL WITH THAT OF THE GENERAL. 
C MODELS OF N AND M! 
C 
DIMENSION ASUM (9) ,A~EAN (9), BSUM (9), BHEAN (9) ,NUM (9) 
D~IENSION AMPLIT ( 9 ) ~ NUMBER ( 9 ) ~ INU~IBR ( 9 ) 
C 
C COMPUTATION OF THE MEANS IN THE BASIC MATERIAL 
C CONSISTING OF A SET OF N AND M SOUNDS 
WP.ITE (~, 222 ) 
222 FORMAT("I's' ')~ 
K-- 1.00000 
GO TO 401 
400 K=K+ 1 
: The principle of calculating the means 
: is presented in the PROGRAM I • 
C SET THE AMPLITUDES IN ORDER OF MAGNITUDE 
INDIV=O 
GO TO 59 
770 K=- 1.O0000 : 
:. The principle of calculating the order of 
. magnitude is presented in the PROGRAM II. 
C FORM THE ORDINAL NUMBERS:FOR EXAMPLE:THE GREATEST 
C AMPLITUDE WAS THE NINETH IN ORDER 
DO %50 I=I~9 
IORDER=O 
DO 2 M=I~9 
IORDER=IORDER+ I 
IF (AMEAN (I)-BMEAN (M))7,7,2 
7 N~(I)=IORDER 
BMEAN (M)=-9999999.0 
GO TO 450 
i0 
C 
C 
2 CONTINUE 
%50 CONTINUE 
WRITE(3,1~)(NUM(L),L=I,9) 
1~ FOI~IAT('O'I'0RDINAL NUMBERS'iT20,916) 
30 
51 
51 
FORM THE NUMERICAL MODEL 
NODEL=O 
MULTPL=IO0000OO0 
DO 50 N=lt9 
IPROD=NU~I(M~MULTPL 
MODEL=MODEL+IPROD 
MULTPL=~IULTPL/IO 
CONTINUE 
IF(K-1.O0000)~9,51,52 
WRITE(5,51)MODEL 
FORMAT('O','MODEL OF N',T15,IIO) 
IN=MODEL 
THE SAME PROCEDURE CONCERNING M 
riO TO ~00 
52 WRITE(3,66)MODEL 
66 FORF~T('O','MODEL OF M'~T15~I10) 
IN=MODEL 
MEAN=(IN+IIM)/2 
WRITE(5,111)MEAN 
111FORMAT('O'j'THE MEAN OF N AND M'~T25,I10) 
FOR~I THE FORM A NEW NASAL SOUND 
550 DO 330 M=lt9 
330 ANEAN(M)=O.O 
DO 35% M=1,9 
53~ BNEAN(M)=O°O 
READ(1,9)(AMEAN(M),M=I,9) 
9 FORMAT(gF%.O) 
IF(AMEAN(I)-36°OOO00)660,888,888 
66O GO TO 770 
:~9 NAS=MODEL 
WRITE(5,77)MODEL 
77 FOP~IAT('O'~'MODEL OF NASAL'tTI8sIIO) 
CLASSIFICATION OF THE NEW NASAL SOUND 
INDIV=INDIV+I 
WRITE(3,98)INDIV 
98 FORMAT('O't'INDIVIDUAL',TIStI3) 
IF(NAS-MEAN)801,802,803 
801WRITE(3~900) 
900 FORMAT(' I$I: M) 
GO TO 12~ 
802 WRITE(3tg01) 
901FOR~IAT(' ','= M OR N') 
II 
GO TO 123 
805 WRITE(3,9.2) 
902 FORMAT(' '~'= N') 
125 CONTINUE 
GO TO 550 
888 CONTINUE 
END 
If the recognition of the natural languages isn't 
possible, we should consider the possibility of an ar~i- 
fical language, which would be easy to be recognized by 
a machine. 
If the social need of the recognition automations 
becomes very great, it is possible that the conservative 
orthography of many language will disappea~ and the 
phonematic orthography will become common. 
The discriminant analysis used in this contribution 
has been programmed by Mr, S° Sarna in the Computation 
Centre of the University of Helsinki (of.2)° 

References

Must.nan I Seppo: Multiple Discriminant Analysis in 
Linguistic Problems. Nordsam 6%~ Det Femte Nordiska 
Symposiet 5ver Anv~ndning av Matematik Maskinero 
Stockholm 18.-22o8o196%. 

Sarna! Seppo: Erotteluanalyysin periaatteet ja k~yttS- 
mahdollisuudet, Mimeographed cop~. Computation Centre 
of the University of Helsinki (1968). 

Cooley! W,W. and Lohnes: Multivariate Procedures for 
Behavioral Sciences, New York, John Wiley and Sons 
(1962)o 

Tillman! H°G,: Akustische Phonetik und linguistische 
Akustik. Phonetica 16: 1%3-155 (1967). 

TillmannlH.Go ! HeikeIG. ! SchnelleIH. und Un~eheuerlG@: 
Dawid I - ein Beitrag zur automatischen "Spracherkennung"o 
5 e congres international d'acoustiqueo Liege 7-1% septembre 
1965. 
