SPEECH RECOGNITION SYSTEM FOR SPOKEN JAPANESE SENTENCES 
Minoru Shigenaga, Yoshihiro Sekiguchi and Chia-horng Lai 
Faculty of Engineering, Yamanashi University 
Takeda-4, Kofu 400, Japan 
Summary: A speech recognition system for continu- 
ously spoken Japanese simple sentences is de- 
scribed. The acoustic analyser based on a psy- 
chological assumption for phoneme identification 
can represent the speech sound by a phoneme 
string in an expanded sense which contains acous- 
tic features such as buzz and silence as well as 
ordinary phonemes. Each item of the word diction- 
ary is written in Roman letters of Hepburn sys- 
tem, and the reference phoneme string and the 
reference characteristic phoneme string necessa- 
ry for matching procedure of input phoneme se- 
quences are obtained from the word dictionary 
using a translating routine. In syntax analysis, 
inflexion of verbs and adjectives and those of 
some main auxiliary verbs are taken into account. 
The syntax analyser uses a network dealing with 
state transition among Parts of speech, predicts 
following words and outputs their syntactic in- 
terpretation of the input phoneme string. The 
semantic knowledge system deals with semantic 
definition of each verb, semantic nature of each 
word and the schema of the sentence, and con- 
constructs a semantic network. The semantic anal- 
yser examines semantic validity of the recogniz- 
ed sentence as to whether each word in the sen- 
tence meets the definition of the recognized 
verb or others. The present object of recogni- 
tion is a Japanese fairy tale composed of simple 
sentences alone. The syntactic and semantic anal- 
ysers work well and can recognize simple sen- 
tences provided that the acoustic analyser out- 
puts correct phoneme strings. For real speech, 
though the level of semantic processing is yet 
low, it can recognize 25 blocks out of 33 blocks 
(A block means a part of speech sound uttered in 
a breath.), and 9 sentences out of 16 sentences 
uttered by an adult male. 
1. Introduction 
Intensive studies of speech recognition or 
speech understanding are being carried out \[1-3\], 
but there are some fundamental problems to be 
solved both in acoustic analysis and linguistic 
processing. The authors think there must exist 
some fundamental procedures to be applicable to 
any task in speech recognition, and are trying 
to solve the problems through the behavior of 
two recognition systems which deal with Japanese 
sentences \[4\] and FORTRAN programs \[5\] spoken 
without interruption. 
Both the recognition systems consist of two 
parts: an acoustic analyser and a linguistic pro- 
cessor. In the acoustic analysis, recognition 
model based on a psychological assumption is in- 
troduced for phoneme identification. As a result, 
speech sound has come to easily be expressed in 
a phoneme string in an expanded sense that con- 
tains some acoustic features such as buzz and 
silence as well as ordinary phonemes. The sys- 
tems require a process of learning a small num- 
ber of training samples \[6\] for identification 
of the speaker's vowels, nasals and buzz. In the 
linguistic processor, using major acoustic fea- 
tures as well as linguistic information haSmadeit 
possible to effectively reduce the number of 
candidate words. For sequences of phonemes with 
erroneous ones has also been devised a graphic 
matching method \[7\] more suitable for matching 
than the one using dynamic programming. 
In the previous system for Japanese sen- 
tences, sentences were narrowly limited in a 
pre-decided style. In the new system, as shown 
in Fig. i.i, the knowledge system is much rein- 
forced. That is, in the syntax analysis, inflex- 
ion of verbs and adjectives and those of some 
main auxiliary verbs can be referred;.thus the 
syntax analyser may be able to deal with various 
kinds of simple sentences. A simulation has con- 
firmed the ability of syntax analyser for simple 
sentences which have been offered in terms of 
Roman letters without any partition between 
words. In the semantic knowledge source, seman- 
tic definition of verbs, natures of nouns, a 
simple schema for a topic are stored, and seman- 
tic network will be constructed as a recognition 
processgoes on. This semantic knowledge is used 
to yield, at the end of spoken sentence, the 
most semantically probable sentence as an output 
and occasionally to reduce the number of candi- 
date words in co-operation with the syntax ana- 
lyser. 
2. Acoustic Analyser and Matching Method 
A psychology based model is used to obtain 
neat phoneme string from speech wave using the 
following feature parameters determined every 
ten milli-seconds \[5\]. 
(i) Maximum value of amplitudes, 
(ii) Number of zero-crossing, 
(iii) Normalized prediction error, 
(iv) Pareor-coefficients, 
(v) Variation of Parcor-coefficients between 
successive frames, 
(vi) Frequency spectrum, 
(vii) Formant frequencies. 
The output phonemes and their decision methods 
are given in Table 2.1. The obtained output pho- 
neme strings contain 5 Japanese vowels, a nasal 
group, an unvoiced stop consonant group, /s/, 
/h/, /r/, buzz parts and silence. Discrimination 
of each stop consonant \[8\] and that of each na- 
sal consonant are not yet embodied in this sys- 
tem. 
Vowels and /s/ having long duration and si- 
lent parts are used as characteristic phonemes. 
-- 472 
Knowledge source 
Semantic knowledge 
~~ {Defin~tion~ \]ema  i/erb / 
<structure) (network b 
~Semantic 
information I 
I 
) Semantic analyser I 
Syntactic knowledge 
Syntactlc State k__/ L__dsyntactic 
transition~----~Inflexion) I ~information/ networkS__ | 
Knowledge about vocabulary 
, \] , Li_~v°cabulary ~_~ 
fCharacteristic~ /Word i hl "\information/ 
analyser \[ IMatching unit III 
~-~\] {Candidate ~ I \word strings/ 
Selector of ~Matching ~ I\] candidate words unit 
I (Phoneme ~string) 
I Acoustic ~ analyser\] 
Fig. 1.1 Speech recognition system. 
Besides an ordinary word dictionary, a character- 
istic phoneme dictionary (This dictionary exists 
only implicitly and is automatically composed 
from the word dictionary which is written in Ro- 
man letters.) is prepared and presents major a- 
coustic features of each word. These major fea- 
tures are used for reduction of the number of 
candidate words. 
For matching between a phoneme string with 
erroneous phonemes and items of the word or char- 
acteristic phoneme dictionaries, a new matching 
method using graph theory is devised \[7\]. 
These acoustic and matching processings are 
the same as the ones in the previous systems. 
3. Knowledge Representation 
3.1. Syntactic Knowledge 
3.1.1. Classification of Japanese words 
for machine reco@nition 
In order to automatically recognize 
continuously spoken natural languages, it 
is necessary to use syntactic rules. How- 
ever using the original form of Japanese 
grammar written by grammarians is not nec- 
essarily suitable for mechanical recogni- 
tion. Moreover it is very difficult to re- 
duce the number of predicted words only by 
syntactic information because of the nature 
of Japanese language which does not require 
to keep the word order so rigorously. Tak- 
ing account of these conditions, Japanese 
words are classified as described in the 
following article and the syntax may pref- 
erably be represented by state transition 
networks as shown in section 3.1.3. 
3.1.1.1. Classification of words by parts of 
speech 
Each word is classified grammatically as 
given in Table 3.1. In Japanese nouns, pronouns, 
numerals and quasi-nouns (KEISHIKI-MEISHI in Jap- 
anese) are called substantives (inflexionless 
parts of speech in Japanese grammar, TAIGEN in 
Japanese), and verbs, auxiliary verbs and adjec- 
tives are called inflexional words (inflexional 
parts of speech! YOGEN in Japanese). Meanwhile 
the words No. 1 - No. ii in Table 3.1 are inflex- 
ionless words and the words No. 12 - No. 15 are 
'able 2.1 Output phonemes and their decision methods. 
Class Output Phoneme Decision Method 
Vowel i,e,a,o,u 
Parcor-coefficients k, 
Nasal m'n'9'N using Bayes decision theory 
Buzz denoted by B 
s Number of zero-crossings 
Fricative Variations of amplitude and 
h spectrum, Number of zero- 
crossings, and Unsimilarity 
to vowels and nasals 
r Liquid 
Unvoiced 
stop 
Silence 
p,t,k 
Variations of amplitude and 
first formant frequency, 
Number of zero-crossings 
Following after silence and 
Having high frequency 
components 
Small amplitude 
--473-- 
inflexional words. In No. 16 the inflexion rules 
necessary for each inflexional word are written 
in appropriate forms. The additional word "car- 
riage return" in No. 17 is a special symbol. We 
ask each spejker to utter the word "carriage re- 
turn" at the end of each sentence in order to in- 
form the recognizer of the end of a sentence. 
Japanese verbs, adjectives and auxiliary 
verbs are inflexional. The verb's inflexion has 
been classified traditionally into 5 kinds of in- 
flexion types: GODAN-KATSUYO (inflexion), KAMI- 
ITCHIDAN-KATSUYO, SHIMO-ICHIDAN-KATSUYO, SAGYO- 
HENKAKU-KATSUYO and KAGYO-HENKAKU-KATSUYO. But 
we classify them into 14 types as given in Table 
3.2 taking into account the combination of the 
stem, a consonant following the stem and the in- 
flexional ending of each word. Examples are 
shown in Fig. 3.1. By so doing the number of in- 
flexion tables becomes smaller. 
The adjectives and verbal-adjectives(KEIYO- 
DOSHI in Japanese) have we classified into 3 
types according to their inflexion. Two types of 
them are shown in Fig. 3.2. 
The inflexion of auxiliary verbs is the 
same as the traditional one. Some examples are 
Table 3.1 Classification of words by parts 
of speech. No.16 and 17 are exceptional. 
No. part of speech 
1 
2 
3 
4 
5 
6 
7 
8 
9 
l0 
ii 
12 
13 
14 
15 
16" 
17" 
noun 
pronoun 
nmneral 
quasi-noun 
prefix 
suffix 
part modifying substantives 
adverb 
conjunction 
exclamation 
particle 
verb 
adjective 
auxiliary verb 
subsidiary verb 
inflexion 
carriage return 
Table 3.2 Classification of verbs. 
No Inflexion Example 
1 
2 
3 
4 
5 
6 
7 
8 
9 
i0 
ii 
12 
13 
14 
GODAN-KATS UYO 1 
,, 2 
,, 3 
" 4 
,, 5 
" 6 
,, 7 
" 8 
,, 9 
,, i0 
KAMI - I CH I DAN- KATS UYO, 
SHI MO- I CHI DAN-KATSUYO 
SAGYO-HENKAKU- KATSUYO 
KAGY O- HENKAKU -KATS UYO 
Verb: ARU (be) 
IKU 
KATS U 
NORU 
KAU 
SHINU 
YOMU 
YOBU 
SAKU 
OSU 
OYOGU 
OKI RU 
NAGE RU 
SURU 
KURU 
ARU 
shown in Fig. 3.3. 
YOMU 
Word Stem 
IKU I ~ 
(go) 
Inflexion Following vowel 
First I Ending consonant vowel Consonant & vowel 
(a) -- K ~ A (i. negative) 
I (2. RENYO) 
U (3. conclusive) 
U (4 RENTAI) 
E (5. conditional) 
E (6. imperative) 
OU(7; volitional) 
(b) T ~ TA (3) (auxiliary) 
TA (4) verb 
TE (particle) 
(read) 
Fig. 3.\] 
Inflexion 
Adjective I 
Adjective II (Verbal- 
adjective ) 
YO ~ M -- the same as (a) \ 
N (c) ~ DA (3) (auxiliary) 
DA (4) verb 
DE (particle) 
Inflexion of verbs: IKU (go) (No.l in Ta- 
ble 3.2) and YOMU (read) (No.6 in Table 
3.2). RENYO or RENTAI means that the 
following word must be inflexional or 
substantive respectively. The following 
words TA and DA are auxiliary verbs and 
TE and DE are particles. 
Word I Stem I Inflexion 
UTSUKUSHII UTSUKUSHI 
(beutiful) 
SHIZUKADA 
(being 
quiet ) 
Fig. 3.2 
SHIZUKA 
~!I (3) 
(4) 
u (2) 
EREBA (5) 
AROU ( 7 ) 
T -- TA (3) 
T -- TA (4) 
DA (3) 
NA (4) 
ARA ' 5 
\~DA~ou (7) 
~DAT -- TA (3) 
--DAT -- TA (4) 
Examples of inflexion of an adjective and a 
verbal-adjective. The numbers in parentheses 
are identified with the ones in Fig. 3.1. 
Word Stem Inflexion 
(2) 
(3) 
(4) 
(5) 
(7) 
Fig. 3.3 
Word Stem Inflexion 
NAI NA~iU (4) (3) (2) 
\k'KERE--BA c5) 
~KAT---TA (3) 
'KAT---TA (4) 
Examples of inflexion of auxiliary verbs. The 
numbers in parentheses are identified with 
the ones in Fig. 3.1. 
-474 
3.1.1.2. Classification of words by syntactic 
functions 
In a Japanese sentence some words express 
material (no~ma) such as substantives and verbs, 
and the others express syntactic function (no~- 
sis) such as particles and auxiliary verbs \[9\]. 
The latter controls the syntactic function of 
the former, or; in other words, gives a material 
word or phrase a modifying function and these 
two words usually appear in a pair in sentences. 
The pair is called a phrase, and some modifying 
relation is established between phrases. And 
those modifying relations between phrases com- 
pose a sentence. In some cases a phrase consists 
of only a word such as an adjective, an adverb 
and some inflexional word, without being accom- 
panied by any word that expresses a syntactic 
function, and itself carries a syntactic func- 
tion. Some examples are shown here. 
(i)WATASHI (pronoun) NO (particle) HON ~noun 
" I " "book" 
..................... phrasel l 
modifying relation 
(my books) 
adjective) H~A (noun ) SHIROI ...... ( white flower 
phrase \[ 
modifying relation 
(white flowers) 
ISHI noun ) NO (particle) IE (noun) 
(stone l house 
phrase 
modifying relation 
(stone houses) 
HON.noun. KONO(7 in Table 3.1) GA(particle).. 
ph}ase this ~ (book) 
I modifying relation 
................................. 
phrase (This book ...) 
(ii) TOKYO (TOKYO)n°un . E (particle,to I~U (verbgo) 
I phrase I modifying relation 
(go to TOKYO) 
HON (noun. UO (partlcle) KAU (verb) 
book) l buy 
phrase 
I modifying relation 
(buy a book) 
The syntactic relation is classified into 
three categories: 
(a) Modification of a substantive word or 
phrase 
Some examples are shown in above (i). 
(b) Modification of an inflexional word or 
phrase 
Some examples are shown in above (ii). 
(c) Termination (the end of a sentence). 
3.1.3. Szntactic state transition network 
A syntactic state transition network is a 
network which represents the Japanese syntax\[10\]. 
The standard form is shown in Fig. 3.4, where 
each S represents a syntactic state, an arrow 
a transition path to the next state, C a part of 
speech, and I syntactic information. Therefore, 
if a state S O is followed by the part of speech 
C O then the state transits context-freely to S 1 
outputting syntactic information I 0. 
To an inflexional word a transition network 
is also applied and represents the inflexion. In 
speech recognition it is necessary to pursue the 
whole transition from the stem of an inflexional 
word to the end of inflexion, in other words, to 
predict the stem of an inflexional word with its 
inflexional ending and to output the syntactic 
information comprehensively for the whole words 
including their inflexions. In Fig. 3.5 is shown 
an example of transition network and accompany- 
ing syntactic information for two verbs "IKU(go)" 
Fig. 3.4 
c0/I ° 
Standard form of syntactic state 
transition network. SO, Sl: states, 
CO: part of speech or inflection, 
I0: syntactic information. 
re 
re 
re 
Fig. 3.5 Transition network for verbs: "IKU (go) 
and YOMU (read)" with their inflexion 
and syntactic information. X/Z means 
that X is output letters and Z is the 
syntactic information. ~: empty, CR: 
carriage return, P: particle, and the 
numbers are identified with the ones 
in Fig. 3.1. 
--475-- 
and "YOMU (read)". This procedure corresponds to 
predicting all possible combinations of a verb 
with auxiliary verbs. For example, for a word 
"go", it may be better to predict probable com- 
binations: go, goes, will go, will have gone, 
went and so on, though the number of probable 
combinations will be restricted. 
The syntactic state transition network can 
not only predicts combinable words but also out- 
puts syntactic information about modifying rela- 
tion between phrases. 
3.2. Knowledge about Vocabulary 
3.2.1. Word dictionary 
Each word is entered in a word dictionary 
in group according to part of speech as shown in 
Fig. 3.6. Each entry and its inflexion table are 
represented in Roman letters together with seman- 
tic information. If a part of speech is predict- 
ed using the syntactic state transition network, 
a word group of the predicted part of speech is 
picked out from the dictionary. 
3.2.2. Automatic translating routine for Roman 
letter strings and inflexion tables 
This routine translates a word written in 
Roman letters into a phoneme string using a 
table \[ii\]. A translated phoneme string of a pre- 
dicted word is used as a reference for matching 
an input phoneme string. This routine can also 
extract the characteristic phoneme string of a 
word. A characteristic phoneme string of a word 
contains only phonemes to be surely extracted 
from the speech wave. It is composed of vowels, 
/s/ and silence, and represents major acoustic 
information of a word. Some examples of the pho- 
neme strings are shown in Table 3.3. 
For matching procedure between an input pho- 
neme string and a predicted word are used both 
phoneme and characteristic phoneme strings of 
the word. Here, these phoneme strings are not 
stored in the word dictionary. The system has 
only one word dictionary written in Roman let- 
ters and phoneme stringsnecessary for matching 
are produced each time from the word dictionary 
using the translating routine. This fact makes 
it very easy to enrich the entry of vocabulary. 
part of Word 
speech 
C0~ WOO 1 
W002 
CI------~ WI01 
WI02 
C2-----~ W201 
W202 
Fig. 3.6 Word dictionary. 
Table 3.3 Examples of phoneme and characteris- 
tic phoneme strings of words. P: un- 
voiced stop, N: nasal, B: buzz, .: 
silence. 
Word Phoneme Characteristic 
(Pronunciation) string phoneme string 
OZIISAN OBSIISAN OISA 
YAMA IEAMA AA 
SENTAKU SEN.PA.PU SE.A.U 
OOKII OO.PSI O.SI 
3.3. Semantic Knowledge 
Semantic information is used for the follow- 
ing purposes. 
(i) Elimination of semantically inconsistent 
sentences which have been recognized using only 
acoustic and syntactic information. 
(ii) Future development to semantic understand- 
ing of natural language by forming semantic net- 
works. 
(iii) Control of transition on the syntactic 
state transition network through the syntax ana- 
lyser. 
3.3.1. Semantic information 
One of the semantic information dealt with 
is "knowledge about meaning". This knowledge in- 
volves (i) what each word means, (ii) verb-cen- 
tered semantic structure, and (iii) schema of a 
story \[i0\]. The other information is, so called, 
"remembrance of episode" which means the remem- 
brance of a topic of conversation. In the pre- 
sent system, meaning of a word is represented by 
a list structure, and the others are represented 
by networks. 
In the system the knowledge about meaning 
must be given from outside and can not yet be in- 
creased or updated by itself, but remembrance of 
episode can be increased or updated whenever new 
information comes in. While, if a schema has 
been already formed for a topic to be talked 
from now on, the knowledge of the topic will 
help recognition of the spoken topic. In the fol- 
lowing sections how semantic information works 
in the recognition system will be explained. 
3.3.1.1. Meaning of a word 
Denote a word by n, its characteristic fea- 
tures by fi(i=l,...,m; m is the number of fea- 
tures). Then, the meaning of a word may be ex- 
pressed as follows: 
n(fl' f2' "''' fm )' 
where 
f. = 1 when the word has the characteristic 1 
feature f, l 
f = 0 when the word has not the feature f . 1 1 
For example, if fl = concrete, f2 = creature, f3 = 
animal, .... then 
hill (1, 0, 0, ..... ), dog (i, i, i, ..... ). 
476 .... 
3.3.1.2. Definition of a verb 
A verb plays very important semantic role 
in a simple sentence. A semantic representation 
of meaning of a verb is shown in Fig. 3.7, where 
n O , n I , ..., n. are nodes, and Ar I, Ar 2, .., Ar. l 1 
attatched to each arc are the natures of each 
arc. The nature of a node n is determined by a P 
nature Ar attatched to the arc directing to the 
P 
node n . Thus, P 
Structure = (V, Arl, Ar 2, ..., Ari), 
in I = a word or node qualified by 
a nature Arl, 
Restriction " 
n. a word or node qualified by l 
a nature Ar. 1. 
For example, a verb "IKU (go)" is defined by 
Fig. 3.8. 
3.3.1.3. Schema 
The form of a schema can not be determined 
uniquely. Dealing with a story, we may be able 
to represent the schema, for example, as shown 
in Table 3.4 and Table 3.5. 
3.3.1.4. Remembrance of an episode --- Forma- 
tion of a semantic network 
Refering to the results of syntactic analy- 
sis and the relation between the nature of an 
arc and a case particle (partly involving anoth- 
er particle), the system forms a semantic net- 
work for a simple sentence centering a recog- 
nized verb. For instance, if a word sequence 
OZIISAN WA YAMA E SHIBAKARI NI IKIMASHITA. 
(An old man went to a hill for gathering) 
firewoods. 
with syntactic information is given, a network 
shown in Fig. 3.9 will be formed. In Fig. 3.9 a 
process constructing a sentence is also shown. 
3.3.2. Linking a semantic network for a sen- 
tence with a semantic network for an 
episode 
After a network for a sentence has been 
formed, the network must be linked Up with the 
already constructed network for the current epi- 
sode. For this purpose a new node must be identi- 
fied with the same node in the episode network. 
< nl> 
Fig. 3.7 
verb \] 
isa , 
Ar I 
< no> , 
< n 2 > < n 3 > 
Definition of a verb. 
n: node, Ar: nature of an arc, 
isa: is an instance of. 
IKU (go) 
isa ~.~ <n5> 
<nl>~<n0~-~ at T 
~ n2) ~'~r°m L lto L~~.-~< n4 > 
<n3> 
Fig. 3.8 Definition of a verb "IKU (go)" 
sub: subject, L: location, T: 
time, isa: is an instance of, 
ino: in order to. 
Table 3.4 A schema of a story. 
Story Title 
Scenes Opening scene Episode 
m, n, o j k, i, m 
event i event 2 
Characters A, B, C, D A, B, E, F A, C 
Other key words m, n 
event n 
X, Y, Z 
a, b, c 
Table 3.5 A schema for a tale "MOMOTARO (a brave boy born out of a peach)". 
Story MOMOTARO 
Scenes Opening 
scene event 1 
an old man Characters an old man 
an old woman 
Other key words once upon a time, live hill, fire- woods, go 
Episode 
event 5 
Momotaro, dog, 
monkey, pheasant 
treasure, bring 
--477-- 
Word sequence recognized using acoustic and syn- 
tactic information: 
OZIISAN WA YAMA E SHIBAKARI NI IKIMASHITA. 
(an old) to a for gathering) (went) 
man (hill) (firewoods 
Forming phrases and giving syntactic informa- 
tion: 
OZIISAN WA YAMA E SHIBAKARI NI IKIMASHITA. ...................................... 
\[phrase having\] \[RENYO\] \[RENYO\] \[RENYO\] verb, end 
Constructing a sentence by showing modifying re- 
lation: 
OZIISAN WA YAMA E SHIBAKARI NI IKIMASHITA. 
(modification) 
(a) Process of constructing a sentence. 
an old man go 
~isa ~:sa <i05> i-~ gathering 
~ino firewoods 
<lh> ~-~- <ioo> ...__._t~ 
from / Ito L "~<i04> 
<102> <103> isa ~ hill 
(b) Semantic network. 
Fig. 3.9 Process of constructing a sentence (a) 
and its semantic network (b) for "An 
old man went to a hill for gathering 
firewoods.". --- shows a phrase, 
shows modification and RENYO in \[ \] 
means this phrase modifies an inflex- 
ional word or phrase, ino: in order to. 
In the present system all relations explicitly 
appearing in sentences and nodes expressing lo- 
cation are examined whether they have already ap- 
peared or not. Time relation is not handled un- 
less it appears explicitly in sentences. Deeper 
structures of meaning such as causality or rea- 
soning are not yet able to be dealt with. Fig. 3. 
i0 illustrates a network for the episode, which 
has been constructed after the system has proces- 
sed several sentences at the beginning of the 
tale of "MOMOTARO" shown below. 
There lived an old man and an old woman. 
The old man went to a hill for gathering fire- 
woods. 
The old woman went to a brook for washing. 
She was washing on a brookside. 
3.3.3. Word prediction by a conjunction "TO 
(and)" 
When the syntax analyser has found a con- 
junction "TO (and)" which is used to enumerate 
some nouns, the system can predict a following 
noun group. For instance, for the input "MOMOTA- 
RO WA INU TO ... (MOMOTARO was accompanied by a 
dog and ... ", the system picks up as a follow- 
ing noun a noun group having similar natures to 
those a dog has. 
3.3.4. Application of semantic knowledge to 
speech recognition 
Using semantic knowledge the system ad- 
vances recognition process as follows: 
(i) Using acoustic and syntactic information, 
and sometimes semantic information, the system 
processes an input sentence and outputs several 
word sequences. The syntax analyser gives to 
each word sequence necessary syntactic informa- 
tion such as part of speech of each component 
word, phrase and modifying relation between 
an old man 
< > l' ~sub 
~ and Tisa 
/ and from / ~ f~ ~ ~from } T / \tot / IL / . 
<i007 >~ ~ ~ / 1 / ino / i015" isa 
isa ub L rOm <lO10< at T ><1014 > 
/an o~ld woman~ ~ Lisa/// \]to L ..... 
| at T /<1020> isa > go <" <i~13\ isa • hill 
<1024>~ 1~10#25>~ ~'~<1023> isa > brook 
washing ~ <1030> isa > do 
/ f r om~~ T 
sub ~T "~<I034> 
1033 
Fig. 3.10 A network for the episode constructed after processing the several 
sentences at the beginning of the tale of "MOMOTARO". 
gathering 
firewoods 
--478 - 
phrases. 
(ii) The semantic processor, using this syntac- 
tic information, forms a semantic network for 
each word sequence. 
(iii) A word sequence for which a semantic net- 
work failed to be formed satisfactorily is re- 
jected because of semantic inconsistency. For in- 
stance, for an input sentence: "OZIISAN WA YAMA 
E SHIBAKARI NI IKIMASHITA.(An old man went to a 
hill for gathering firewo6ds.)", an output word 
sequence: "OZIISAN WA HANA (flower) E SHIBAKARI 
NI IKIMASHITA." is rejected, because the verb 
"IKU (go)" has an arc "to Location" but the out- 
put word sequence has no word meaning location 
and also the word "HANA (flower)" has no appro- 
priate arc in the network. 
(iv) Taking into account the result of syntax 
analysis and reliability of acoustic matching, 
the most reliable word sequence is output. 
(v) Finally, the semantic network of the out- 
put sentence is linked with the semantic network 
of the episode formed by this process stage. 
4. Results 
We have been dealing with a Japanese fairy 
tale, "MOMOTARO" consisting of simple sentences 
and are now improving the system performance. 
The system's vocabulary is 99 words in total ex- 
cepting inflexion of verbs, auxiliary verbs and 
adjectives. For simple sentences, the syntactic 
and semantic analysers work well. Furthermore 
the syntactic analyser alone can exactly recog- 
nize simple sentences with correct phoneme 
strings which would be provided from an ideal a- 
coustic analyser. Though the level of semantic 
analysis is in its first stage, for simple sen- 
tences the semantic analyser can reject semanti- 
cally inconsistent word sequences. 
Therefore the acoustic analyser must be im- 
proved first of all. Its performance is as fol- 
lows: The total number of output phonemes expect- 
ed for an ideal acoustic analyser is 826 for the 
whole 16 test sentences from the tale, while the 
number of correct phonemes obtained from the an- 
alyser is 741 (89.7 %), and that of erroneous 
phonemes is 125 (15.1%), in which the numbers 
of mis-identified phonemes, missing phonemes and 
superfluous phonemes are 25, 60 and 40 respec- 
tively. 
The system can successfully recognize 25 
blocks (a part of a sentence uttered in a breath) 
out of 33 blocks, and 9 sentences out of 16 sen- 
tences. 
5. Conclusion 
We have just started to construct a speech 
recognition system which can deal with semantic 
information and inflexion of words and have many 
problems to be solved. IIowever, from this experi- 
ment it may be able to say as follows: 
(i) The acoustic analyser gives Pretty neat pho- 
neme strings, if only a learning process using 
Bayes decision theory for a group of vowels, na- 
sals and buzz is executed for each speaker. 
ii) Use of global acoustic features is effec- 
tive to reduce the number of predicted candidate 
words, though its effectiveness is not so much 
as in case of our isolatedly spoken word recogni- 
tion system \[12\]. 
(iii) In Japanese, inflexion of inflexional 
words are complicated, and the number of Roman 
letters involved in the stem and inflexional end- 
ing of each verb or each auxiliary verb is usual- 
ly very small. Especially the number of letters 
which very important particles have is much 
smaller. These aspects are very unfavorable for 
speech recognition in which ideal acoustic pro- 
cessing can not be expected. But the syntactic 
and matching processors can, to some extent, pro- 
cess input phoneme strings with erroneous pho- 
nemes satisfactorily. 
(iv) Developing the vocabulary is very easy. 
Of course we must improve the capability of 
the syntactic and semantic analysers and also de- 
velop the vocabulary. 
References 

i. Reddy, D.R.: "Speech recognition", Invited 
papers presented at the 1974 IEEE symposium, 
Academic Press (1975). 

2. Sakai, T. and Nakagawa, S.: "A speech under- 
standing system of simple Japanese sentences 
in a task domain", Trans. IECE Japan, Vol. 
E60, No. i, p.13 (1977). 

3. Koda, M., Nakatsu, R., Shikano, K. and Itoh, 
K.: "On line question answering system by con- 
versational speech", J. Acoust. Soc. Japan, 
Vol. 34, No. 3, p.194 (1978). 

4. Sekiguchi, Y. and Shigenaga, M.: "Speech re- 
cognition system for Japanese Sentences", J. 
Acoust. Soc. Japan, Vol. 34, No. 3, p.204 
(1978). 

5. Shigenaga, M. and Sekiguchi, Y.: "Speech re- 
cognition of connectedly spoken FORTRAN pro- 
grams", Trans. IECE Japan, Vol. E62, No. 7, 
p.466 (1979). 

6. Sekiguchi, Y. and Shigenaga, M.: "A method of 
phoneme identification among vowels and na- 
sals using small training samples", Acous. 
Soc. Japan Tech. Rep., $78-17 (1978). 

7. Sekiguchi, Y. and Shigenaga, M.: "A method of 
classification of symbol strings with some 
errors by using graph theory and its applica- 
tion to speech recognition", Information pro- 
cessing, Vol. 19, No. 9, p.831 (1978). 

8. Shigenaga, M. and Sekiguchi, Y.: "Recognition 
of stop consonants", 10th ICA, (1980). 

9. Suzuki, K.:"NIPPON BUNPO HONSHITSURON"(Funda- 
mental study on Japanese grammar), Meiji-sho- 
in (1976). 

I0. Norman, D.A. and Rumelhart, D.E. : "Explora- 
tions in cognition", W.H. Freeman and Company. 
(1975). 

ii. Sekiguchi, Y. and Shigenaga, M.: "On the 
word dictionary in speech recognition system", 
Reports of Faculty of Eng., Yamanashi Univ., 
No. 28, p.122 (1977). 

12. Sekiguchi, Y., Oowa, H., Aoki, K. and Shige- 
naga, M.: "Speech recognition system for FORT- 
RAN programs", Information Processing, Vol.18 
No. 5, p.445 (1977). 
