Interactive Speech Understanding 
Iliroaki Saito 
Dept. of Mathematties 
Keio University 
Yokohama, 223, JAPAN 
E-mail: hxs@nak.math.keio.ae.j p 
Abstract 
This paper introduces at robust interactive 
method for speech understatnding. The gener- 
atlized LR patrsing is enhanced ill this approach. 
Patrsing proceeds fl'om left to right correcting mi- 
nor errors. When at very noisy portion is detected, 
the patrser skips that portion using a .fake non- 
terminal symbol. The unidentified portion is re- 
solved by re-utterance of thatt portion which is 
parsed very efliciently by using the parse record 
of the first utterance. The user does not have to 
speak the whole sentence again. This method is 
also catpatble of hatndling unknown words, which 
is imlmrtatnt in pra.ctical systems. 1)erected un- 
known words earn I)e incrementatlly incorporatted 
into the dictionary after the interatction with tile 
user. A pilot system has shown great elfectiveness 
of this atpproach. 
1 Introduction 
It has been continuously mentioned thatt some 
kind of latnguage knowledge is essential in good- 
quality speech understanding. Until recently, 
however, most research has focused mainly oil 
word recognition atnd one of the excellent recogni- 
tion systems built to date is Sphinx developed by 
Lee \[7\]. Although SI)hinx atttained atn excellent 
word accuracy of 96 % on at 997-word task, its 
sentence recognition accuracy drops slgnificatntly 
clue to its use of only at stattisticaJ trigra~l gratm- 
iilal'. 
There hatve been at few atttempts to integratte at 
speech recognition device with a nattural language 
understanding syste,n, ltatyes el al. \[3\] adopted 
technique of case fi'ame instantiation to patrse at 
continuously spoken English sentence in the form 
of at word lattice (a set of word catndldattes hy- 
pothesized by at st)eech recognition module) and 
produce at frame representation of the utterance. 
The case frame patrsing hats been pursued by Poe- 
sio et al. \[8\] and Giatchin et al. \[2\] for instance. 
Meanwhile, at compiler-oriented shift-reduce 
LR parsing technique hats been used for speech 
recognition recently due to its no-batcktracking 
tatl)le-drlven ei\[iciency \[12, iII, 6\]. Becatuse the 
parsing proceeds from left to right pruning low- 
l)robatl)ility t)atrtiatl-parses, the correct parse catn 
not be obtained if the parsing fails to find the 
correct path in the beginning. Moreover, it is 
sometimes difficult to handle tim very noisy input, 
esl)ecially the input with missing words. Thus an 
Lll. parser sometimes yields totally incorrect but 
syntactically-sound hypotheses or no hypotheses 
att all. This weakness is occasionally cited to 
demonstrate superiority of the pa.rsing method 
nsing much simI)ler bigram or trigratm grammars 
in which the re.covery in the middle of the in- 
put earn be done at eatse. In this paper, we de- 
scribe at method of enllatncing the generalized 1,R 
(GLR) parsing towatrds interactive speech under- 
standing. 
Section 2 describes the enhatnced GLR parrs- 
lug. Section 3 describes the rol)ustness of the 
parser and presents an interatctive method to re- 
solve the unclcatr I)ortion of the input and un- 
known words. Section 4 experiments the effec- 
tiveness of the technique in parsing spoken sen- 
tences. Finally the concluding rematrks atre given 
in Section 5. 
2 Enhanced GLR Parsing for 
Speech Understanding 
Ill this section, tile GI,R patrsing method is de- 
scribed first. Then some techniques which en- 
hatnce the robustness are described. 
AcrEs DE COLING-92. NANTES, 23.28 ^ot~'r 1992 1 0 S 3 Paoc. OF COLING-92, NAtcrras. Aoo. 23-28, 1992 
2.1 Background: GLR Parsing 
The LR parsing technique was originally devel- 
oped for the compilers of programming languages 
\[1\] and has been extended for natural language 
processing \[11\]. The GLI\[ parsing analyzes the in- 
put sequence from left to right with no backtrack- 
ing by looking at the parsing table constructed 
from the context-flee grammar rules in advance. 
An example grammar and its parsing table are 
shown in Figure l and Figure 2 respectively. 
Entries "s n" in the action table (the left part 
of the table) indicate the action "shift one word 
from the input 1)uffcr onto the stack and go to 
state n". Entries "r n" indicate tile action "re- 
duce constituents on the stack usiug rule n". The 
entry "ace" stands for the action "accept", and 
t)lank spaces represent "error". "$" in the action 
table is the end-of-inl)ut symbol. The goto table 
(the right part of the table) decides to which state 
the parser shouhl go after a reduce action. The 
LR parsing table in Figure 2 is different fi'om reg- 
ular LR tables utilized by the compilers in that 
there are multiple entries, called conflicts, on the 
rows of state 11 and 12. While the encountered 
entry has only one aztion, parsing proceeds ex- 
actly the same way as the regular LR parsing. 
In case there are multiple actions in an entry, all 
the actions are executed with the graph-structured 
stack \[11\]. 
(1) S --> NP VP 
(2) S --> S PP 
(3) NP --> n 
(4) NP --> det n 
(5) NP --> NP PP 
(6) PP --> prep NP 
(7) VP --> v NP 
Figure 1: Example CFG Rules 
2.2 GLR Parsing for Erroneous Sen- 
tences 
The original GLR parsing method was not de- 
signed to handle ungrammatical sentences. This 
feature is acceptable if the domain is strictly de- 
fined and input sentences are correct at all times. 
Unfortunately, accuracy of speech recognition is 
not 100%. Common errors in speech recognition 
are insertions, deletions (missing words), and sub- 
<Action Table> I <Goto Table> 
det n v prep $ I NP PP VP S 
....................................... 
s3 s4 
s6 
s7 s6 
slO 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
tl 
12 
s3 s4 
s3 s4 
r3 r3 
r2 r2 
rl rl 
r5 r5 r5 
r4 r4 r4 
r6 r6,s6 r6 
rT,s6 r7 
2 1 
ace 5 
9 8 
r3 
11 
12 
9 
9 
Figure 2: GI, R Parsing Table 
stitntions. Some techniques have been developed 
to handle erroneous sentences for the GLR pars- 
ing \[12, 10\]. 
• The action table can be looked up in a 
predictive way to handle a missing word. 
Namely, a set of possible terminal symbols 
{Ti} at State i can be missing word candi- 
dates. 
• This way of using the action table is also use- 
ful to handle substitution and insertion er- 
rors. I.e., the table can tell which part of the 
input should be replaced by a specific symbol 
or ignored. 
'\['he parser explores every possibility in 
paralleP. 
2.3 Gap-filling Technique 
The techniques described in tile previous section 
can not handle such a big noise as two consecutive 
missing words. To cope with this, the gap-filling 
technique \[9\] is presented here. 
In tile gap-filling GLR parsing, the goto table 
is consulted just the same way as the action table, 
in addition to its regular usage. Namely, at state 
si which is expecting shift action(s), the parser 
also consults the gore table. If an entry m ex- 
ists along the row of state sl under the column 
lie practice, pruning is incorporated to reduce search 
by using the likelihood attached to each word in the speech 
hy potheses, 
ACIT~ DE COLING-92. NANTES. 23-28 ^OI~T 1992 1 0 5 4 PROC. OF COL1NG-92. NAN'r~s, AU6.23-28, 1992 
labeled with nontel'nlinM 1), the parser shifts D 
onto the stack an(l goes to state m. Note that no .:zINP\] -~ 
input is scanned when this action is performed. ..~):\]~-+~ 
When the input is in<:omplete, the parser pro 0% 
\",. wo cut Bad duces hyl)otheses with a fake nonterminal at tile +"+ ~n .+~v". adj 
noisy position. ::, ...." +".... 
We show an example of l)arsing an incorrectly NO 2 Wp \] -B INP\] ~2 
recognized sentence "we cut sad with a kuife" us- 
ing the grammar in I:igure 12 and the LI¢ table 
in Figure 2. :~ At the initial state 0, the got() ta- 
Ill( +, tells that the nonterminals N P and S can I>e 
shifte(1 using the gap-filling technique. Although 
the first wor<t "we" (noun) is expected at state 
0, these fake+ nonterminals are ere+areal (\]"igure 3) 
in ca+se "we" is an incorrectly recognized word. 
Tile new states for the fake tlonterminals NP and 
S are 2 and 1, resi>ectively, q'he goto table tells 
that fake nonterminals PP and VP can be place(\[ 
at state 2. In this case, however, we do not create 
these nonterlltinals~ l)eeause two fake l/oIltel'llti- 
nals r+u'ely need to I>e I)\[a(:e<\] adjacently in prac 
|ice. No further fake nonterminal is att, a.ched to ii i 
tile fake nonterminal S for the same reason. ~! v+, 
/\[NP) 2 
O% 
'",, we OOl had with a knilo 
n v ad i prop ell n 
l:igure 3: I'arse Trace 
Iu parsing the third word "sad", a fake nonter- 
minal \[NP\] to word "cut" keeps the correct path 
(Figure ,1). 
l'arsing continues in this way and the linal situ- 
ation is shown in Figure 5. As a result, the parser 
tinds two snccessfifl parses: 
(n (v (\[NP\] (prep (det n))))) 
((n (v \[NP\])) (prep (de| n))) 
Namely, the \])arser Jinds <rot that the third 
word is incorrect and must be the word(s) in NP 
category. 
2'J'he terminal symbols of this grammar are grammati- 
cal category names called prcterminals. A lexicon should 
be prepared to map all actual word to its pretcrmina\]. 
:~'l'hc techniques in the previous section arc enough for 
parsing this erroneous se/dencc. We use this eXaml>le only 
for describing Ihe gal~ |iliing techJ,illue. 
with a knMe 
pe~ ~1 n 
Figure 4: Parse Trace (cont'd) 
+,"\[NPl 
~";',. ~ eel ~¢+ wi~h / . 
'\ n ," v ~-, ad I .,' I)t ~+*;~ ", dol 
~i! ".. \ -- k~ 
!!! \' VP 
k.m 
n io 
i1 
S 
.r" ,,," 
1 
pp 
Figure 5: Parse Trace (conq)lete) 
3 Interactive Speech Under- 
standing 
In this section, the rot)tLstness <)f tlw (;LR parser 
with various error-recovery techniques (esl)ecia.lly 
the gap-filling te(:htdque) aga.inst a noisy input is 
described. Then an interactive way to resolve the 
unidentified portion is I(reseld.ed. 
3.1 Resolving Unidentified Portion 
The gap-filling teehniqtm enhances the robustness 
of the (HAl parsing in handling a noisy int)ut as 
folk>ws: 
• A fake nonterminals fills big missing con- 
stituents of the input which would yiehl no 
hylmtheses without the gap-.tilling func+tion. 
* The gap+filling fiHtction enables an LR parser 
to perform reduce actions only when the ac- 
tion creates a definite high-score nontermi- 
hal. The fake nonterminal is likely to I)e ci- 
tllel +111 illSel'ti<')ii of all tlllkllo~,vii word, 
ACI'ES DE COLING-92, NANTES, 23-28 AO(ff 1992 1 0 5 5 PROC. Ol: COLING-92, NANTES. AUG. 2,3+28, 1992 
A gap filled with a fake nonterminal can be 
resolved by reanalysis of the input under the 
constraint that that portion of the input should 
yield the specific nonterminal. This top-down re- 
analysis would be effective against the genuinely 
bottom-up GLR parsing. In practice, however, a 
more reliable way is to ask the user to speak only 
the missed portion. In the previous example, only 
the portion of \[NP\] shouhl lie st)oken again. 
The parser can analyze the re-utterance effi- 
ciently ,as follows: 
1. The parser keeps the parse record of the first 
input. 
2. The parser starts parsing the new input just 
where the fake nonterminal was created. 
3. The parsing ends when tim same-name real 
nonterminal symbol is created out of the re- 
utterance. 
3.2 Handling Unknown Words 
If the reutterance cau not be parsed correctly 
even by the reutteraime, the unidentified portion 
is likely to contain an unknown word. Finding an 
unknown word by a specific nonterminal symbol 
enables the interactive grammar augmentation as 
the following, for instance. 
The parser can not identily the ~ollowing 
portion of your input. 
We cut \[NP\] with a knife 
If this is a new word in the category of \[NP\] 
a rule 
NP --> (recog. result of the 2nd utterance) 
will be added to the grammar. Is this ok? 
Handling unknown words is important in natu- 
ral language processing. For example, Kainioka et 
al. \[5\] proposed a mechanisnl which parses a sen- 
tencc with unknown words nsing Delinite C, lause 
Gralumars. The efficient gap-filling technique of 
handling unknown words is quite useful in prac- 
tical systems and enhances the robustness of the 
GLR parsing greatly. 
When an unknown word W,,~, is detected, the 
word should be incorporated into the system. If 
the grammar is separated from the lexicon, the 
word can be easily added to the dictionary. If 
the grammar contains the lexicon, the LR table 
should be augmented incrementally in the follow- 
ing way. 
1. For each state si which has an entry under 
the column of the nonterminal D($~k,) in the 
goto table, add shift action "s m" (m is the 
new state number) for W .... (If Wnew con- 
sists of such multiple words as "get rid of', a 
new state should be created for each element 
of the words. ) 
2. Add reduce action "r p" (p is the new rule 
number) for all the terminals on the row of 
state nl. 
Before we close this section, wc should consider 
side etfects of the gap-tilling technique. It is true 
that putting fake nonterminals expands search. 
Thus, some side effect might appear if the accu- 
racy of input is not good. Namely, input should 
be good enough to produce distinct fake nonter- 
minals and real nonterminals. Although it is dif- 
ficult to analyze this phenomenon theoretically, 
the following natural heuristics can minimize the 
search growth. 
o Two consecutive fake uontermiuals are not 
allowed as shown in the previous section. 
• When a word (Wi) can be shifted to both 
a fake nonternfinal D.fake and a same-name 
real nonterminal D~e,z, only D~,t should be 
valid. 
• When D:,~ and D,.~l (:an be bundled using 
the local ambiguity packing \[111 tecbnique, 
discard l) f (,k,,. 
4 Experiments: Parsing Spo- 
ken Sentences 
We evaluated effectiveness of tlle enhanced GLR 
parsing by spoken input. We used a device which 
recognizes a .lapanese utterance and produces its 
phoneme sequence \[4\]. The parser we used is 
1)ased on the (-HA/ parser exploring the possi- 
bilities of substituted/inserted/deleted phonemes 
\[10\] by looking up the eonfilsion mntrix, which 
was constructed from the large vocabulary data. 
The confusion matrix is also used to mssign the 
score to each explored phoneme, because the 
recogldtion device gives neither the alternative 
phoneme candidates nor the likelihood of hypoth- 
esized phonemes. The gap-filling fimction is in- 
corporated iuto the parser in the following experi- 
ments. Parsing a l>honeme seqnence might sound 
less pot>ular than I)arsing a word lattice in speech 
AcrEs DE COUNG-92, NANTES, 23-28 Ao(:r 1992 1 0 S 6 PROC. OF COLING-92, NANTES, AU6.23-28, 1992 
recognition. Because the parser builds a lattice 
dynamically in parsing the sequence from left to 
right using a CFG which contains the dictionary, 
no static lattice is necessary. 
125 sentences (five speakers pronounced 25 sen- 
tences) were tested in tim domain called "conver- 
sation between doctors and patients." 111 sen- 
tences were parsed correctly \[88.8 %\] (the correct 
sentence was obtained as the top-scored hypoth- 
esis). 14 failed sentences can be classified into 
three groups: 
(i) 4 sentences were parsed as the top-scored 
hypotlmses with fake nonterminals. Thus the 
parser asked the user to speak the unidentitied 
portion again. 
(ii) 6 sentences were parsed incorrectly in that 
the correct sentence did not get the highest score 
mainly because the incorrect nonterminal had a 
slightly higher score than the correct one. In this 
case, both the closely-scored correct and incof 
rect nontermin~s are packed into one nouterllli- 
nal using the local ambiguity packing technique in 
an efficient implementation. In this situation the 
parser should ask the user to speak only that un- 
clear portion in the same way as in (i) instead of 
producing a barely top-scored hypothesis. In the 
current implementation the parser asks the user 
which word is the correct one. 
(iii) 4 sentences were pronounced very I)adly. 
The user has to speak the whole sentence again. 
5 sentences with unknown words were also 
tested, in all eases, the unknown word was de- 
tected. 
This result shows that interactive partial re- 
utterance is very effective both for error-recovery 
and for detection of unknown words. 
5 Concluding Remarks 
We presented a robust interactive apl)roach for 
speech understanding. The GLR parsing method 
WaS enllaliced to recover errors and to skip a very 
noisy portion. These techniques remedy ",dl-or- 
nothing-imss of the CF(Lbased LR t)arsing. The 
skipped portion is represented by a fake non- 
termimd which is resolved l)y re-utterance. An 
unknown word is also detected by a fake non- 
terminal and is incorporated into the dictionary 
incrementally through interaction with the user. 
Exl)eriments in t)arsing a Jal)anese l)honeme se- 
quence have shown a great effectiveness of this 
interactive approach. 
References 
\[1\] Sethi R. Aho, A. V. and J. D. Ullman. Compilers. 
Addison Wesley, 1986. 
\[2\] E. Giachin and C. l{,ullent. Robust Pars- 
ing of Severely Corrupted Spoken Utterances. 
In Proceedings, 12th lnte~mational Conference 
on Computational Linguistics (COLING), Bu- 
dapest, ltungary, August 1988. 
\[3\] llauptmmm A. G. Carbonell J. G. Hayes, P. J. 
and M. Tomita. Parsing spoken language: A 
semantic caseframc approach. In Proceedings, 
l lth lnleT~aalional Conference on Computational 
Linguistics (COLING), West Germany, August 
1986. Bonn. 
\[4\] Morii S. lloshimi M. Hiraoka, S. and K. Niyada. 
Compact isolated word recognition system for 
large vocabulary. In Proceedings, IEEF-IEC'EJ- 
ASJ International Conference ou Acoustics, 
Speech, and Signal Processin 9 (ICASSP), Tokyo, 
April 1986. 
\[5\] T. Kamioka and Y. Anzai. Analysis of sentences 
including unknown words by hypothesis genera- 
tion mechanism. Journal of Japanese Society for 
Artificial Intelligence, Vol.3 No.5, pages 627-638, 
September 1!)88. \[In Japanese\]. 
\[0\] Kawabat, a T. Kits, K. and I1. Saito. tlMM Con- 
tinuous Speech Recognition Using Predictive LR 
Parsing. In ICASSP, May 1989. 
\[7\] K.F. Lee. Large- Vocabulary Speaker-htdependenl 
Continuous Speech Recognition: The SPHINX 
System. Phi) thesis, Computer Science Depart- 
ment, Carnegie Mellon Uniw;rsity, April 1988. 
\[8\] M. Poesio and C. lhlllent. Modified Cm'~eframe 
Parsing for Speech Understanding Systems. In 
Proceedings, lOth International Joint Conference 
on Artificial Intelligence (IJCAI), Milan, August 
1987. 
\[9\] II. Saito. Gap-tilling LR Parsing for Noisy Spoken 
Input: "Ibwards Interactive Speech Hx~eognition. 
In Proceedings, httcTmattonal Conference on Spo- 
ken Language Processing (ICSLP), Kobe, Japan, 
November 1990. 
\[10\] II. Saito and M. Tomita. Parsing Noisy Sen- 
tences. In Proceedings, 12th International Con- 
ference on G'omputalional Linguistics (COL- 
IN(;), Budapest, lhmgary, August 1988. 
\[11\] M. Tomita. l';Jlieient Parsing for Natural Lan. 
guage, l(luwer Academic Publishers, Boston, 
MA, 1985. 
M. qbmita. An efficient word lattice parsing algo- 
rithm for continuous speech recognition. In Pro- 
ceedin.qs, IEEE-1ECEJ-ASJ htternational Con- 
ference on Acoustics, Speech, and Signal Process- 
ing (ICASSP), 2bkyo, April 1986. 
\[12\] 
ACRES DE COLING-92, NANTES, 23-28 AO(rr 1992 1 0 5 7 PRoc. OF COL1NG-92, NANTES, AUO. 23-28, 1992 
