A TRIAL OF JAPANESE TEXT INPUT SYSTEM 
USING SPEECH RECOGNITION 
K.Shirai Y.Fukazawa T.Matzui II.Matzuura 
Department of Electrical Engineering 
Waseda University 
3-4-10kubo Shinjuku-ku 
Tokyo, 160 Japan 
Summary 
Since written Japanese texts are expressed by 
many kinds of characters, input technique is 
most difficult when Japanese information is proc- 
essed by computers. Therefore a task-independent 
Japanese text input system which has a speech 
analyzer as a main input device and a keyboard 
as an auxiliary device was designed and it has 
been implemented. The outline and experience of 
this system is described in this paper. 
The system consists of the phoneme discrimina- 
tion part and the word discrimination part. 
Acoustic analysis and phonemic discrimination 
are effectively performed using an estimation 
method of the articulatory motion from speech 
waves in the phoneme discrimination part. And its 
outputs are lattices of Japanese pseudo-phonemes 
corresponding to speech inputs. 
In the word discrimination part, phonemic strings 
are corrected using various kinds of a priori 
knowledge anu transformed into suitable kinds of 
characters. On behalf of it, this part mainly 
consists of the refrieval of the word dictionary 
indexed by vowels, and the similarity calculation 
using dynamic programming. 
i. Introduction 
Speech recognition systems can be classified 
according to whether they recognize isolated or 
connected words. But in either case, word recog- 
nition has been thought a basic problem. In the 
speech understanding system, a method of utiliz- 
ing various kinds of a priori knowledge, which 
is obtained from a phonemic, a syntactic and 
a semantic level on clearly defined tasks, has 
been developed. And it has been clarified how 
information at each level contributes to recog- 
nition. I-~ 
While many understanding systems set the goal 
at the understanding of contents of speech, our 
nain object is the task-independent Japanese 
texts input system. Because a system which uti- 
lizes properties of a task too much is not so 
practical and the phonemic discrimination must 
be the core of the speech recognition system. 
This system consists of the phoneme discrimina~ 
tion part and the word discrimination part. 
In the phoneme discrimination part, connected 
spoken words is segmented into discrete phonemic 
strings. The main difficulties at this step are 
the large variation of each utterance and the 
coarticulation, and the difference between 
talkers. So it is important to select acoustic 
parameters which aren't relatively affected by 
these variations. 
This system employs the feature extracton 
method based on the speech production model. The 
articulatory motion for the utterance is estimat- 
ed from speech waves, and the phonemic discrimi- 
nation is performed using the trace of them. The 
followings are the merits to describe the acous- 
tic feature at an articulatory level: 
(i) Processing of coarticulation may be consid- 
ered most clearly in relation to the physio- 
logical factor of speech production. 
(2) Adaptation to a speaker is easier than other 
methods. 
Nevertheless it is almost impossible to const- 
ruct perfect phonemic strings corresponding to 
spoken words only by the phonemic discrimination. 
So in the word discrinimation part, the phonemic 
strings are corrected using various kinds of 
a priori information. At the same time, it is 
neccessary to transform the phonemic strings into 
the suitable kinds of characters in order to print 
the texts. Because written Japanese texts are 
expressed by many kinds of characters, which are 
Kanji (ideo-graphic character), Hira-gana (cursive 
form of Kana, Japanese syllabary), Kata-kana 
(square form of Kana) and special symbols. 
The confusion matrix between phonemes, the 
phonemic connective rules, the word dictionary 
and so on have been already used as a priori 
information in word discrimination systems, 
and their validity has been also ascertainedl '~°6 
Therefore, theword discrimination part mainly 
consists of the retrieval of the word dictionary 
which is indexed by vowels, and the similarity 
calculation using dynamic programming and the 
phonemic confusion matrix. Because the phoneme 
discrimination part can recognize vowels more 
correctly than consonants and the phonemic 
confusion matrix reflects the characteristics of 
the phoneme discrimination part exceedingly. 
In order to translate from Kana to Kanji, some ,~ 
systems with the word dictionary for Kana-Kanji 
translation and with syntax and semantics 
analyzer, are suggested. But the task domain 
should be restricted so as to get the high trans- 
lation rate. 
Main difficulties of Kana-kanji translation 
are as follows: 
(i) Segmentation from input character s'trings 
to words. 
(2) Identification of synonyms 
In this system, identifiers inserted from a sim- 
ple keyboard serve as spaces between words. And 
the difficulty of the latter is solved by redun- 
dant inputs since speech input is much easier 
than other input techniques. 
m 
--464-- 
micropho~ne phoneme recognition 
keyboard 
syntactic ~_ 
approach l l \] 
~wo~a ~efe~en~e~___~ 
matching I KANJI printer 
I 
Fig 1-1 Block diagraB of the system 
The paper is structured as follows. The next 
section gives a feature extraction method using 
the articulatory model. And then the segmentation 
of the continuous speech, and the discrimination 
technique of vowels and consonants are described. 
Section 3 gives an outline of the word discrimi- 
nation method which uses the word dictionary 
• constructed systematically and employs some 
priori knowledge. Section 4 concludes with a 
brief description of the results obtained with 
the system thus far. 
.2. Phoneme discrimination.• 
2.1 Segmentation 
In the recognition of continuous speech, it is 
effective to discriminate voiced, unvoiced and 
silence. Next four parameters are used for 
voiced-unvoiced-silence decision. 14 
1) Power of signal. E:lO.log\[1 N1s2(n)\]Nn ~- (2-1) 
where N is the number of samples in one frame. 
0 , SO 100 250 h i" "" "" i"'h"d"" ""h"'h"*h"'h"d""h"'h"ih"'h"'hmh"' '''d'"'hmh"i ""h"d"" "" ii"h"'hmh"' "' 
...... 1' ' \[frame\] 
a w a 
"*'°" f--q I 
silence 
E, 
Fig. 2-1 Example of analysis /mikawasima/ 
2) The ratio of the power in low frequency and 
the total power of the signal. 
Power in low frequency = N Er= 
Total Power n~ (n) /n-~ 2(n) 
3) Zero-crossing count in one frame. (2-2) 
4) Normalized autocorrelation coefficient at 
unit delay, Pl as 
N //' N N-1 i2-3) 
Pl=n\[lS(n)s(n-1)//(n~lS2(n))( Z s2(n)) n=0 
These parameters are computed for each frame 
of samples and the example of analysis shown in 
Fig.2-1. The decision is performed using the 
next quadratic discrimination function. 
d i(x)=(x-~i)tc i-l(x-~i )+Inl cil (2-4) 
After this decision is made for each frame, 
smoothing algoritl~ is used . By this smoothing 
algorithm, noises and disturbances in the VUS- 
decision can be modified. The result of this 
decision is shown in Fig.2-1. 
2.2 Articulatory model 
Several authors have studied the estimation of 
the vocal tract shape from the speech wave. If 
the artieulatory motion is successfully estimat- 
ed, the result may be a good feature of the 
speech and effective for the recognition. 7-9 In 
this paper, a new technique is used to estimate 
the vocal tract shape, which depends on the 
precise analysis of the real data of the articu- 
latory mechanism. 10-13 
In this section, the articulatory model which 
constitutes the base of this method is intro- 
duced. This model relies on the statistical 
analysis of the real data and makes it possible 
to represent the position and the shape of each 
articulatory organ very precisely using the 
small number of parameters. And further, the 
physiological and phonologicalconstraints are 
included automatically in the model. 
The total configuration of the model is shown 
in Fig.2-2. The jaw is assumed to rotate with 
the fixed radius FJ about the fixed point F, and 
the location J is given by the jaw angle Xj. The 
lip shape is specified by the lip opening height 
L h, the width L w and the protruction Lp relating 
to the jaw position on the midsagittal plane. 
The tongue contour is described in terms of a 
semi-polar coordinate system fixed to the lower 
Velum p A L A T E   _5 ter 
.. 
F(O,O) p J T °ngu' 
0oso.iption of orticu ato.y 
G --G'(ag,bg) 
Glottis 
--465-- 
jaw and represented as a 13 dimensional vector 
r=(rl ..... rla). The variability of the usual 
articulatory process, there may be some limita- 
tions on account of physiological and phonologi- 
cal constraints, which can be expressed by the 
strong correlation of the position of each 
segment along the tongue contour. Therefore, it 
becomes possible to represent the tongue shape 
effectively using only a few parameters. These 
parameters can be extracted from the statistical 
analysis of X-ray data. 
A principal component analysis is applied and 
the tongue contour vector for vowels r v may be 
expressed in a linear form as, 
r v = li=l XTiVi + rL (2-5) 
where vi's (i=l,2 ..... p) are eigenvectors, and 
r v is a mean vector for vowels which corre- 
sponds roughly to the neutral tongue contour. 
The eigenvectors are calculated from the next 
equation C v. = ~.v. (2-6) V 1 
1 1 
where C v is the covariance matrix which is de- 
fined by N I _ - _ ? )T}(2-7) 
C v = ~ ~=l{(rvk rv) (rvl~ ,, 
and Xi is the corresponding eigenvalue to sat- 
isfy the characteristic equation 
Icv - ~II = o ~2-~ 
The same stastical technique as described 
above may be used for the expression of lip 
shape. ( k -- L h 1Js + L b 
I'w = \[ k2'Jl~3+ ~'; 
Lp \]~ 
where Os signifies the distanc, 
19 
between the up- 
per and the lower incisors. The active movement 
of the lips is reflected in the second term, and 
X L is the lip parameter. 
The total model with the nasal cavity is shown 
in Fig.2-3 and the characteristics of the artic- 
ulatory parameters are summarized in Table 2-1. 
Details about the model are found in the refer- 
ences. 
2.3 Estimation of articulatory motion 
In this section, the estimation of 'the articu- 
F. \] 
Let an m-dimensional vector y be the acoustic 
feature to represent the vocal tract transfer 
function of the model. In this study the ceps- 
trum coefficients are adopted as the acoustic 
parameters. The acoustic parameters are express- 
ed asa nonlinear function h(x) of the articula- 
tory parameters which means the transformation 
from x to y through the speech production model. 
On the other hand, let sY be acoustic parameters 
measured from the speech wave after glottal and 
radiation characteristics are removed. Then, the 
estimate ~ of the articulatory parameters is ob- 
tained as to minimize the next cost function. 
2 (2-10) J(=k 3=11 sYk -b(Xk)lf~+ II ×k I1~* II % -~k-lllr 
where R,Q and F are the matrices of weight, k is 
frame number and ~k-1 is the estimate at the 
previous frame. In the cost function, the 1st 
term is the weighted square errors between the 
acoustic parameters of the model and the measur- 
ed values. The 2nd and 3rd terms are used to re- 
strict the deviation from the neutral condition 
(x=0) and from the estimate of the previous 
frame, respectively. These terms are also effi ~ 
cient to reduce the compensation of the articu- 
latory parameters. 
rils 
latory parameters from the speech wave is con: 
sidered in the framework of the speech produc- Glottis 
tion model. This problem becomes nonlinear opti- XG 
mization of parameters under a certain criterion, -(---9+ 
and it must be solved by an iterative procedure. 
Therefore, the uniquness of the solution and the 
stability of the convergence are significant 
problems. In this method, these problems are 
solved by introducing constraints to confine the 
articulatory parameters within a restricted re- 
gion and by the selection of the appropriate in- + 
itial estimate using the continuity property. 
And also the model adjustment to the speaker 
brings good effects for the estimation. 
and 
latory 
Table 2-1 Articulatery parameters 
Articulatory 
parameters 
Tongue Tonguc J:tw IAp 
XTI ×T2 Xj X L 
Glottis Vehlm 
X G X N 
back high open round open open 
front low clese spread close ¢lose 
--466- 
If the method is applied to a speaker, first 
of all, the mean length of the vocal tract 
should be adjusted for the speaker. The length 
factor SFK which gives a uniform change in the 
vocal tract length, and the articulatory param- 
eters can be estimated simultaneously, Once the 
mean vocal tract length is fixed, only the ar- 
ticulatory parameters are estimated. In Fig. 2-4 
the estimated result for the utterance /aiueo/ 
is shown. 15-17 
2.4 Normalization of articulatory parameters 
The size and shape of each articulatory organ 
are different between talkers, and further the 
manner of articulation varies each other. These 
differences make difficult the talker independent 
speech recognition problem. Therefore, a method 
to be able to adapt for the speaker differences 
is inevitable for speech recognition. 
In the estimated artieulatory parameters in 
the preceding section, the average length of the 
vocal tract is taken into consideration and au- 
tomatically normalized. While the vocal tract 
length is the most significant in this problem, 
other factors still remain. 
vocal tract real estimated 
shape spectrum spectrum 
glottis 
The differences in Japanese 5 vowels between 
adult females are shown in Fig.2-6. In order to 
cancel these differences, a linear transforma- 
tion which changes the distribution of female 
articulatory parameters into that of male ones 
i:s introduced. 
Number of male apoikerll - I0 
XT2 
lul 
III II I I l 
lel 
Ill 
i ; /o/ 
X~ 
Fix.2-$ DLstrlbution for male speakers (A group) 
in XTI-X~2 spacil 
x~ , 
.~.r of fe,.l. .peak.,,. - lO lul ~, ~ 
A 
, ,, -., 
~ ' ') /'/"l, ~ l al ",1 I0-- " ' " ,/ '~I~',~,: '\I , -- I I I I III i 
~~ I tl Ii I 
I I I I I 
i I I I I I I~L---L~ I I I , I I 
u _ Ill 
| | 
- PLB.2-6 Distribut£on for female speakers (b group) 
in XTI-X~2 space 
0- 
9_ 
o 
3O 
kHz ~ frame\]- 
Fig. 2-4 Estimated result 
for the utterance /aiueo/ 
lips 
Nw.ber of female speakers - 10 
XT2 
,% 
/i/ 
Ioi /u/ , 
X~ 
I| 
PiJ,2~7 Distribution for female speakers (b group) 
in XTI-X~2 space after linear transformation 
--467- 
X=AX+B (2-ii) 
wherex=(XT1XT2,Xj,XL) ~ :modified vectors coming ' near the male distribution. 
X=(XTI,XT2,Xj,XL) T :estimation value for 
the female The matrices A and B are caiculated by the 
least squares method. In Fig. 2-5 ,2-6 and 2-7 
distribution for 10 males and females are shown 
respectively. The distribution of Fig.2-6 is 
transformed by the above linear transformation 
into Fig.2-7. It should be noticed that the ma- 
trices A and B were determined by data of i0 fe- 
male who are not contained in the group shown in 
Fig.2-6. The results for the vowel recognition 
experiment are shown in Table 2-2. 
In this experiment, the number of subjects is 
50, that is 30 males and 20 females. Each sub- 
ject uttered every vowel five times. They are 
grouped every ten subjects, then there are three 
male groups A, B, C and two female groups a, b. 
Experiment I ; Recognition of male voices after 
learning by male voices 
Experiment II ; Recognition of female voices 
after learning by female voices 
Experiment HI; Recognition of female voices 
after leaning by male voices 
Experiment ~; Recognition of female voices 
using normalization of linear 
transformation, where the refer- " 
ence is made b Z male voices 
These results show the effectiveness of the 
articulatory parameters for the talker independ- 
ent speech recognition and the normalization 
procedure is useful for the male-female trans- 
formation. 
2.5 Discrimination of continuous vowels by usin~ 
articulatory parameters 
The locus of estimated articulatory parameters 
about /aiueo/ is shown in Fig. 2-8. 
Initially, continuous vowels are discriminated 
frame by frame using a quadratic discrimination 
function of articulatory parameters. As this re- 
Table 2-2 Confusion matrix for the result of vowel recognition 
Experiment I 98.8% Experiment ~I 94.9% 
/i/ 
/u/ 
/e/ 
/o/ 
/a/ /i/ /u/ /e/ /o/ 
/a, 298 - 2 
296 4 - /i/ 
- 300 - - /u/ 
- 2 - 298 /e/ 
9 - 1 - 290 /o/ 
Experiment 1I 98.0% 
I~ /a/ li/ /u/ /el /o/ 
/a/ 
/i/ 
/u/ 
/el 
/el 
99 - 1 
96 4 
99 1 
i00 
2 2 - 96 
lal Ill lul ~el Ioi 
295 - 1 4 
- 286 - 14 - 
1 298 1 - 
2 1 9 288 - 
37 6 257 
Experiment ~ 97.3% 
lal lil lul ~el Iol 
/a/ 289 - Ii 
li/ - 298 - 2 
/u/ - 299 1 
/el - 15 - 285 
/o/ I ii 288 
sults involve transient parts, it is necessary 
to estimate them. Then S(k) is computed as the 
parameter of discriminating them. 
S(k)= I~Tl(k+l) -XTl(k 1 1+ \[XTo(k+ 1)-XTo 
+\]Xj(k+I)-Xj(k)J+JXL~+I)-XL(k~t k) l (2-12) 
It can be regarded the valley of S(k) ~s the 
stable part and the high part of S(k) than the 
threshold t as transient part. But, it is diffi- 
cult to distiguish between stable part and tran- 
sient part enough. So stable part is decided 
with changing threshold t. 
The speakers in this experiment are 2 male a- 
dults. Each speaker spoke 23 kinds of continuous 
5 vowels 2 times. This continuous vowels involve 
all 60 kinds of connected vowel VIV2Va(VI,V2,V3: 
a,i,u,e,o) such as aie, oei. 
Table 2-3 shows the errors occurred by coartic- 
ulation. 
front 
Fig. 2-8 
The locus of articulatory 
parameters 
Table 2- 
\]low 
3 Vowel discrimination 
XT2 
high 
/aiueo/ 
......... i 
errors 
speaker S.M. 
NO.I No. 2 No. 3 
! /aiueol 
2 /aeiou/ 
3 /aeuol/ 
4 /iaoiu/ 5oiu~ouu \[6ao~o 
5 /iuoea/ 5uo--u 
6 lieaou/ 
7 /uaieo/ 17 
8 /uaoel/ lei~ii 7a°~° 8 uaO~uOO ~ei--ii \[ ei~i 
9 /ueaio/ \[9eai~eei 
I0 /uoaeo/ 
II /eiuae/ 2 
12 /euioa/ e ~i 
13 /eoaui/ 
14 /eouei/ 
15 /oiaue/ 
16 /oieua/ 
17 /ouiae/ 
18 /auoua/ 
19 /uieia/ 
20 /eauau/ 
21 /eoioe/ ;ioe~iue 
22 /oaiai/ ~lai~iei 
23 /oeueu/ 
)a~o 
\[Oei~ ii 
20io~ue O 
2~i~ i 
L~ e~ii 22 
iae~iee 
L2 2~ OU~OO auo~ao 
24 \[3 uie~ue 
L~au~ "Ou_~ 
L5 
ueu~ulu 
25 
iai~ iei 
T.S. 
No.4 
26 
ao~a 
28 
uao~uou 
29eai~eel 
3O 
eau~eeu 
Joe ~ lue 31 
--468- 
2.6 Consonant recognition using articulatory 
parameters 
By the data of the gliding section between 
consonant and vowel, consonants are discriminat- 
ed. In this discrimination experiment, a method 
of DPmatching applied for the loci of the artic- 
ulatory parameters which are estimated for the 
gliding section. Simple acoustic parameters are 
used together. 
Speech data in this experiment is uttered by 
2 male adults, each speaker spoke continuous 
VzCV2. (Vz is fixed at /a/, V2 is vowel /a/,/i/, /u/,/e/,/o/, 
C is consonant /g/,/z/,/d/,/b/,/h/, /s/,/p/,/t/,/k/ 
where /adi/,/adu/ were omitted.) 
One data is for reference pattern, another is 
for the test. 
Since the articulatory organs show relatively 
speedy motion in the interval from the consonant 
to the vowel, the frame length is selected as 12 
ESTi~TION OF ~' liATQIING FINkL 
PAR/~TI!I~ 
Fig.2-8 .Block diagra m of the recognition experiment 
Table 2-4 Results of the recognition experiment 
~m/~ /~ /~ 
/ga/ 20 o o o 
2 /za/ 2 z6 o 
/cYa/ o o 2o o 
./hV o o o 2,) 
n' /',ca/ /aa/ /ta/ /pa,' /ha/ 
20 0 0 0 0 
/aa/ 1 17 2 o o 
/tus/ 6 0 13 1 o 
/pa/ \] o 1 ~. 1 
/ha/ 0 1 0 0 19 
/~/ /zi/ /hi/ 
Iqil 18 1 1 
,/~/ 0 2O 0 
/bL/ 0 1 19 
/',ct/ /,,hi/ /c:~./ /pi/ ,44j 
/k:L/ 13 0 2 0 5 
/~ht/ 0 20 0 0 0 
/d~L/ 0 0 20 0 0 
/~/ o : e 20 o 
,q~/ 5 0 0 0 15 
/cS./ /=u/ /bu/ 
loW' 19 1 o 
/~v' 2 18 o 
./Ix*/ 0 0 2O 
/~/ /de/ /,,,,/ 
/ge/ 20 0 0 
/~/ 0 18 2 
/de/ o 1 19 
Foe/ o o l 
/../ 
itsu/ 
ipu/ 
,,~/ ~ 
o ,4~/ 
o ~1el 
o I~I 
19 /pc/ 
/he/ 
~'~ 19ol 17,o/ I~ol /bol 
/go/ 15 0 3 2 
/zo/ 0 17 3 0 
/do/ 0 1 16 3 
~o/ 0 I 4 15 
voiced : 91.4 % 
/k~V /~V /b~u/ /pu/ /m/ 
19 0 0 1 0 
0 16 1 1 0 
1 9 10 0 0 
1 0 0 18 1 
0 0 o 1 19 
/~1 leml ~eel /be/ ~,,el 
19 0 0 0 1 
0 20 0 0 0 
0 2 14 4 ' O 
0 0 1 16 3 
0 0 0 2 18 
/ko/ 1"/ 0 3 o o 
/~:~/ o 2o 0 o o 
/to/ 0 1 17 2 0 
lpo/ 1 o 4 14 I 
I/h o/ 0 0 0 4 16 
unvoiced : 85.4 % 
ms with Hamming Window and the pitch syncronous 
method is used for the analysis. For the estima- 
tion initial value is obtained by the piecewise 
linear estimation method at stable vowel section. 
And articulatory parameters are estimated back- 
ward to consonant section, using the parameters 
at the previous frame as the initial value. The 
process of the discrimination experiment is 
roughly shown in Fig.2-9. 
Initially, we estimate the articulatory param- 
eters of every data by the method mentioned a- 
bove and perform the end-free DP matching for 
the reference pattern of every kind of data for 
each speaker. The total power and the power in 
high frequency are also used for the discrimi- 
nation. For the recognition of the voiced con- 
sonants,the results of the DP matching of artic- 
ulatory parameters are reliable. For that of 
the unvoiced consonants,the acoustic parameters 
are mainly used. These results are shown in 
Table 2-4. 
The attempt to extract the feature of conso- 
nants from the gliding part of the articulatory 
motion in the following vowel was successful to 
some extent. But there are some problems in the 
accuracy in the estimation of the articulatory 
parameters. And the individual difference has 
influence on the parameters. 
3. Word discrimination 
3.i How to input Japanese texts 
Ideally speaking, texts which are arbitrarily 
pronounced should be recognized perfectly and 
suitable Kana-Kanji translation should be per- 
formed. But the phonemic discrimination, the 
processing of synonyms and so on are actually 
very difficult. So the system uses a simple key- 
board as an auxiliary input device, it has only 7 
keys and it is enough manageable by one hand. 
In case of input, first of all, a key indicat- 
ing the kind of the character must be pushed. For 
example, 'H' indicates Hira-gana. For Kanji, the 
reading of the compound word, On-yomi (the phone- 
tical reading of Kanji) and Kun-yomi (the Japa- 
nese rendering of Kanji) constructing the word 
are pronounced. Since the acoustic input is much 
easier than the other input techniques, some 
redundant data can be inputted with less burden. 
In the current version, a key indicating the 
kind of the reading pronounced is pushed for the 
sake of the easy processing. 
For Hira-gana, a speaker may pronounce a word 
which contains its reading as the ending. Other- 
wise a user just pronounces Japanese texts as a 
way of speaking with the control information 
from the keyboard. 
control part t main control character 
\[sub control character input data < 
phonemic partl ph°nemic data 
~phonemic probability 
Fig 3-i the component of input data 
to the word discrimination part 
--469-- 
3.2 Input data 
The component of input data to the word dis- 
crimination part is indicated in Fig.3-1. 
The phonemic part is an acoustic data which 
is recognized in the phoneme discrimination part. 
If it cannot decide only one phonemic candidate, 
the phoneme and its probability for each candi- 
date are passed as a lattice. That is to say, 
input data is a lattice of Japanese pseudo- 
phonemes corresponding to the acoustic input. 
Fig.3-2 is an example of input data. 
The control part is inputted from the keyboard 
in order to correct the phonemic data and perform 
effective Kana-Kanji translation. 
letter 
reading 
meaning 
tensai wa kyuju-kyu paasento no ase dearu 
Genius is ninety-nine per cent perspiration. 
data C J (tensai) T TENSAI 
Y (ten) T TEN , (area) T A~ 
Y (sai) T SAI , (toshi) T TOSHI 
HH Y (wa) T WA 
S Y (ku) T KU 
Y (ku) T KU 
K Y (paasento) T PAASENTO 
H Y (no) T NO 
fi Y (ase) T ASE , (Ran) T KAN 
H Y (dearu) T DEARU 
* (ten) --- t8p6k3 e7al n7m7 
Fig 342 Example of an input data 
parameters as follows are got 
S ; sequence of vowels 
I N ; number of vowels 
LR ; reliable phonemes 
get the restricted dictionary(gL) 
ith N ~ 
riet BL with SJ F. :ou. r 
"'calculate the similarity for the BL items I 
satisfying the R restriction \[\[ 
Ithen,some of the most similar items,CA arel\] 
\[selected as candidates ___~l 
L ; boundary of similarity 
(given) 
Fig 3~3 Algorithm to decide some candidates 
3.3 Word discrimination algorithm 
For Kanji, redundant data, which consists of 
a reading of a compound word, On-yomi, and Kun- 
yomi, are given. Some candidates of Kanji corre- 
sponding to each reading are sought using the 
algorithm of Fig.3-3. This algorithm depends on 
the fact that vowels can be discriminated more 
precisely than consonants, and the tendency which 
one phoneme is apt to be misdiscriminated from 
others are statistically known previously. 
Kanji dictionary used contains about i000 
readings of Kanji and corresponding characters. 
This dictionary is divided by the number of 
vowels and further indexed by the sequence of 
them. A part of it is shown in Fig.3-4. 
_reading of KAN3I \] dictionary 
nt~nber of vowels 
.. 
vowo,  
°° °°. 
items 
.... • '' 
Fig 3-4 Example of a dictionary 
Similarity between a discriminated phonemic 
string and a word contained in Kanji dictionary 
is computed by dynamic programming using the 
confusion matrix. This matrix is made using the 
results of the recognition experiments and con- 
tains the misdiscrimination probability to other 
phonemes, the probability of omission and that 
of addition of other phonemes. Among these, the 
probability of addition depends upon the tendency 
that the addition is influenced by the following 
phoneme as tile nature of the phoneme discrimi- 
nation part. 
In dynamic programming, it is assumed that the 
number of the continuous omission or addition of 
phonemes is less than two. The similarity S(D,W) 
between the phonemic string D whose length is I, 
ans W whose length is J is calculated as follows: 
s(D,w) = g(1,1)/i 
g(i,j) = max{ Ll(i,j),La(i,j),Lo(i,j)} 
Ll(i,j) = sim(i,j)+g(i+l,j+l) 
La(i,j ) = sim(i,j+l)+g(i+l,j+2)+adp(i,j) 
Lo(i,j) = sim(i+l, j) +g (i+2, j +I) +omp (j) 
g(l,j) = sup (J-j) 
g(i,J) = sup (I-i) 
where sim(i,j) is the similarity between phoneme 
i and j, and given from the confusion matrix, 
adp(i,j) is the probability indicating that pho- 
neme i is added before phoneme j, omp(i) is 
the probability omitting i, and sup(i) is the 
penalty value of unmatched length. 
This algorithm selects some candidates of Kanji 
which have a high matching score. Since some 
redundant data is given for each kanji, one most 
suitable character is selected and printed. 
--470 
For Hira-gana, the system doesn't have redun- 
dant readings originally in contrast with Kanji 
As it is especially difficult to discriminate 
too short phonemic strings, phonemic corrections 
are neccessary. So the system utilizes the nature 
of Japanese that almost all short Hira-gana 
strings are found as the inflectional part or 
Joshi (auxiliary word in Japanese). For the 
former, a user may pronounce a word which con- 
tains the inflection and the system uses them as 
a redundant data using the inflection dictionary. 
For the latter, the dictionary containing Joshi 
and its connection information is used as a priori 
information. 
In case of Kata-kana, as the usage is restricted 
to the words of foreign origin, the phonemic 
correction is performed using the loanword 
dictionary. At this time, the algorithm of 
Fig.3-2 is available. 
Besides, Japanese text input system must 
process special characters, for example numbers, 
alphabets, punctuation marks and so on. But they 
are a few, so the above algorithm can processed 
them using special character table. 
4. Conclusion 
At present, the system uses YHP-21MX and the 
special purpose hardware for the phonemic data 
extraction, and the phoneme and the word discri- 
mination part are built on HITAC 8800/8700 at the 
Computing Centre of the University of Tokyo. 
Though the performance, the facility and the 
bottle-neck as the total system has not been 
clarified as yet, partially some results and 
problems have been found. 
The phoneme discrimination part aims at recog- 
nizing connected speech uttered by an unspecified 
speaker. On behalf of it, the feature extraction 
method at the articulatory level is tried. As 
a result, stable articulatory motion is estimated 
with the satisfactory precision for vowels, and 
the validity of it is confirmed. And the adapta- 
tion for speakers is effectively performed. For 
consonants, the feature extraction method using 
the transient part of the articulatory motion 
is tried, but it has a little drawpoint at the 
precision and the stability of articulatory 
parameters. It is also influenced by a speaker. 
A combination to the other sound parameters will 
perhaps make the system improve. 
For the word discrimination, the syntax has 
not been used except for Hira-gana at this 
version. Hereafter it is planned to incorporate 
the syntactic information as much as possible 
for the improvement of the performance. But the 
trade-off between the facility and the processing 
speed is an important problem. 
Acknowledgment 
The authors wish to thank J.Kubota, T. Kobayashi 
and M.Ohashi for their contributions to designing 
and developing this system. 

References 

(i) Lea,W.A. Ed.'Trends in Speech Recognition' 
Prentice-Hall (1980) 

(2) Ready,D.R. Ed.'Speech Recognition' Academic 
Press (1975) 

(3) Newell,A. et al.'Speech Understanding 
System: Final Report of a Study Group' 
North-Holland (1973) 

(4) Denes,P.'The Design and Operation of the 
Mechanical Speech Recognition at University 
College London' Jour. Brit. I.R.E. Vol.19 
No.4 pp.219-234 (1959) 

(5) Woods,W.A.'Motivation and Overview of 
SPEECHLIS: An experimental Prototype for 
Speech Understanding Research' IEEE Trans. 
A.S.S.P. VoI.ASSP-23 pp.2 (1975) 

(6) Sakai,T.,Nakagawa,S.'A Classification Method 
of Spoken Words in Continuous Speech for Many 
Speakers' Jour. Inf. Proc. Soci. of Japan 
Vol.17 No.7 pp.650-658 (1976) 

(7)Shirai,K.'Feature extraction and sentence 
recognition algorithm in speech input system' 
4th Int. J. Conf. on Artificial Intelligence, 
506-513, 1975. 

(8)Wakita,H.'Direct estimation of the vocal tract 
shape by inverse filtering of acoustic speech 
waveform, IEEE Trans., ASSP-23, 574-580, 1975. 

(9)Nakajima,T.,Omura,T.,Tanaka,H.,Ishizaki,S. 
'Estimation of vocal tract area functions by 
adaptive inverse filtering methods' Bull. 
Electrotech. Lab., 37, 467-481, 1973. 

(10)Stevens,K.N., House,A.S., 'Development of a 
quantitative description of vowel articula- 
tion',JASA, 27, 484-493, 1955. 

(ll)Coker,C.H.,Fujimura,O.,'Model for specifica- 
tion of the vocal-tract area function',JASA, 
40, 1271, 1966. 

(12)Lindblom,B.,Sunberg,J.'Acoustical conse- 
quences of lip, tongue, jaw and larynx move- 
ment' JASA, 50, 1166-1179,1971. 

(13)Shirai,K.,H~nda,M.,,An articulatory model 
and the estimation of articulatory parame- 
ters by nonlinear regression method' Trans. 
IECE, Vol. J59-A, No.8, 668-674, 1976. 

(14)Atal,B.S.,Rabiner,L.R,,'A pattern recogni- 
tion approach to voiced-unvoiced-silence 
classification with applications to speech 
recognition' IE~E Transo, ASSP-24, 201-212, 
1976. 

(15)Shirai,K.,Honda,M.,'Estimation of articula- 
tory parameters from speech waves' Trans. 
IECE, Vol.61-A, No.5, 409-416,1978. 

(16)Shirai,K.,Honda,M.,'Feature extraction for 
speech recognition based on articulatory 
model' Proc. of 4th Int. J. Conf. on Pattern 
Recognition 1978. 

(17)Shirai,K.,Matzui,T.,'Estimation of articula- 
tory states from nasal sounds' Trans. IECE, 
Vol. J63-A, No.2, 1980. 
