A Context-Sensitive Model for Probabilistie LR Parsing of Spoken 
Language with Transformation-Based Postproeessing 
Tobias Ruland, Siemens AG, ZT IK 5, 1)-81730 Mtinchen 
Tel.: +49-173-369 30 67, Fax: +49-89-929 54 54, Tobias.Ruland@web.de 
Abstract 
This paper describes a hybrid approach to 
spontaneous speech parsing. The implelnented 
parser uses an extended probabilistic LR parsing 
model with rich context and its output is post- 
processed by a symbolic tree transformation routine 
that tries to eliminate systematic errors of the 
parser. The parser has been trained for three 
different s and was successflflly integrated 
in tile Verbmobil speech-to-speech translation 
system. The parser achieves more than 90%/90% 
labeled precision/recall on pmsed Verbmobil 
utterances while 3% of German and 5% of all 
English input caunot be parsed. 
1 Introduction 
Verbmobil (Wahlster, 1993) is a spontaneous 
speech-to-speech translation system and translates 
spoken German to English/Japanese and vice versa. 
Tile main domains are "appointment scheduling" 
and "travel planning". There are several parallel 
analysis and translation modules in Verbmobil as 
described in (Ruhmd et al., 1998) and one of those 
analysis modules is the probabilistic parser 
de,;cribed ill this paper. A schematic diagraln of the 
Verbmobil system architecture is shown in figure 1. 
The input for the Vcrbmobil speaker independent 
speech recognizers is spontaneously spoken 
German (vocabuhlry 10,254 word forms), English 
(7,534 word forms) and Japanese (2,848 word 
forms). The output of the speech recognizers and 
tile prosody module is a prosodically annotated 
word graph. This word graph is sent to the 
Integrated Processing module which controls the 
three parsers (HPSG parser (Kiefer et al., 1999), 
chunk parser (Abney, 1991) and our probabilistic 
parser) of tile "deep" (semantics based) translation 
branch of Verbmobil. Our probabilistic parser is a 
shift-reduce parser and uses an A*-search to find 
the best scored path in the lattice that can be parsed 
by its context fi'ee grammar. Tile output of tile 
parser is the best scored context free analysis for 
this path. This syntax tree is passed to a 
transformation unit that corrects known systematic 
errors of tile probabilistic parser to correct trees. 
The result of this process is passed to a semantics 
construction module and processed by the other 
modules of the deep translation branch as shown in 
figure 1. 
2 Spontaneous Speech Parsing 
Tile Integrated Processing unit uses tile acoustic 
scores of the word hypotheses in tile word graph 
and a statistical trigram model to guide all 
connected parsers through the lattice using an A*- 
search algorithm. This is similar to the work 
presented by (Schmkl, 1994) and (Kompe et al., 
1997). This A*-search algorithm is used by the 
probabilistic shift-reduce parser (see section 3) to 
find the best scored path through the word graph 
according to acoustic and hmguage model 
infornmtion. If the parser runs into a syntactic "dead 
end" in the word graph (that is a path that cannot be 
analyzed by tile context-fl'ee gralllmar of the shift- 
reduce pmser), the parser searches the best SCOled 
alternative path ill tile word graph, that call be 
parsed using tile context-fiee grammar. 
We extracted context fiee grammars for German, 
English and Japanese flom the Verbmobil treebank 
(German: 25,881 trees; English: 23,140 trees; 
Japanese: 4,534 trees) to be able to parse 
spontaneous utterances. The treebanks consist ol' 
annotated transliterations of face-to-face dialogs in 
the Verbmobil domains and contain utterances like 
• and then well you you you have hotel 
in./bnnation 
no 1 am not how about what aboul 
Tuesday the sixteenth 
actually it yeah so seven hour fiight 
The gramnmr of the parser covers only 
spontaneous speech phenomenas that are contained 
in the treebanks. 
During the developlnent o1' the parser we 
encountered severe problems with the size of the 
context-free grammar extracted from the treebanks. 
The German grammar extracted from a treebank 
containing 20,000 trees resulted in a LALR parsing 
table with lnore than 3,000,000 entries, which 
cannot be trained on only 20,000 utterances. The 
reason was that there are many rules in the 
treebank, which occur only once or twice but inl'late 
the context-flee grammar and thus tile size of the 
677 
example-based translation 
"Guten Tag 
Herr Som~ta 
MF. 
lg" 
Figure 1 
size of the parsing table. For this reason we 
eliminate trees from our training material 
containing rules that occur unfrequently in the 
treebank and use only rules achieving a lninimal 
rule count. This threshold is determined 
experimentally in our training process. 
3 A new context sensitive approach to 
probabilistic shift-reduce parsing 
The work of Siemens in Verbmobil phase 1 showed 
that a combination of shift-reduce and unification- 
based parsing of word graphs works well on 
spontaneous speech but is not very robust on low- 
word-accuracy input (the word error rate of the 
Verbmobil speech recognizers is about 25% today). 
One way to gain a higher degree of robustness is to 
use a context-free grammar instead of an 
unification-based grammar, hence we decided to 
implement and test a context-fi'ee probabilistic 
LALR parser in Verbmobil phase 2. 
3.1. Previous approaches 
There am several approaches (see for example 
(Wright & Wrigley, 1991), (Briscoe & Carroll, 
1993/1996), (Lavie, 1996) or (Inui et al., 1997)) to 
probabilistic shift-reduce parsing but only Lavie's 
parser, whose probabilistic model is very similar to 
(Briscoe & Carroll, 1993), has been tested on 
spontaneously spoken utterances. 
While the model presented by (Wright & 
Wrigley, 1991) was equivalent to the standard 
PCFG (probabilistic context-free grammar, see 
(Charniak, 1993)) model, which is not context- 
sensitive and thus has certain limitations in the 
precision that it can achieve, later work tried to 
implement slight context-sensitivity (as e.g. the 
probability of a shift/reduce-action in Briscoe and 
Carroll's model depends oll the current and 
succeeding LR parser state and the look-ahead 
symbol). 
3.2. Bringing context to probabilistie shift- 
reduce parsing 
Like other work oi1 probabilistic parsing our model 
is based on the equation 
P(TIW)--V(T)'P(WIT) ' (2) 
where T is the analysis of a word sequence W and a 
widely used approximation for P(~T) is given by 
P(WIT)~ lq P(w, ll) , (3) 
w,GW 
where /i is the part-of speech tag for word wi in 
analysis T. 
Finding a realistic approximation for P(7) is very 
difficult but important to achieve high parsing 
accuracy. Supposed we approximate P(WIT) by 
equation (3). Then P(WIT) is nothing more than 
P(~L), where L is the part-of-speech tag sequence 
for a given utterance W. If our goal is to select the 
best analysis T for a given tag sequence L we do not 
necessarily depend on a good approximation of 
P(T), but simply select the best analysis for a given 
L by finding a T that maximizes P(TIL ) (and not 
P(7)). Hence, in our model we use P(7\]L) instead of 
P(T) so that 
, (4) 
k 
where Tk is the set of possible analyses for L. Let D 
be the set of all complete shift-reduce parser action 
sequences for L, i.e. dk is the sequence of shift- and 
reduce-actions that generates analysis Tk. Then we 
678 
can define P(dIL) (=I'(7\]L)) as 
H 
VdcD: V(dlL)=HV(a,,Ik,,) , (5) 
j-.I 
where \[d\] is the number of parser actions in d, adj is 
thejth parser action in d and &,: is the context of tile 
parser while executing ad,i. 
3.3. Choosing a context 
"C, ontext" ill equation (5) might be everything. It 
can be tile classical (CurrentParserState; 
LookAheadSymbol)-tuple, it may also contain 
iuformation about the following (look-ahead) 
word(s), elements on the parser stack or tile most 
probable dialogue act of tile utterance, even 
semantical iuformation about roles of the 
syntactical head of the phrase on the top of the 
parser stack. 
The training procedure of our probabilistic parser 
is straightforward: 
I. Construct complete parser action sequences 
for each tree in the training set. Save all 
information (on every action) about the whole 
"context" we have chosen to use. 
2. Count the occurences of all actions in 
different subcontexts. A subcontext may be 
the whole context or a (even empty) selection 
of features o1' the whole context. Compute the 
probability of a parser action regarding to the 
subcontext as the relative frequency of the 
action within lifts subcontext. 
The reason why we build subcontexts is that 
there is a relevant sparse-data-problem in 
Verbmobil. A treebank containing between 20,000 
and 30,000 trees is too small to give reliable wtlues 
for larger contexts in a parsing table containing 
500,000 entries or more. Hence we use the 
smoothing technique that is known as backing-off 
in statistical  modelling (Chamiak, 1993) 
and approximate the probability of an action a with 
context k using its subcontexts ci: 
1"(alk)=C, (6) <,1"(.I..,) 
with ~x~. smnming up to 1. Tile values for ~x: are 
determined experimentally. We have chosen three 
contexts for evaluation (KI and K2 also exist in our 
model but are irrelevant for this evaluation): 
• K3: LR parser state and look-ahead 
symbol, 
• K4:K3 plus phrase head of the top 
element of the LR parsing stack, 
• K5:K4 plus look-ahead word. 
Please see section 5.1. for tile detailed results of this 
evaluation. 
4 Transformation-based error correction 
Parsing spontaneous speech - even in a limited 
domain - is a quite ambitious task for a context fi'ee 
granunar parser. We have a large set of non- 
terminals ill our grammar that also encode 
functional information like Head or Modifier, 
gralnmatical information like accusative- 
complelnent or vexb-prefix besides phrase structure 
information. Our current grammars contain 240 
non-terminals for German, 178 for English and 200 
for Japanese and the lexicon is derived 
automatically fiom the tree bank and external 
resources (there were only minor efforts in 
improving the lexicon manually). 
During the development of the parser we 
observed a constantly declining Exacl Match rate of 
tile parser fiom over 80% in the early stages (with 
just a few hundred trees of training data) to under 
50% today. The reason was that the first training 
samples were simple utterances on "appointment 
scheduling" only, while the treebank nowadays 
contains spontaneous utterances from two domains 
and that there was a growing number of 
inconsistencies ill the treebank due to annotation 
errors and a growing number of annotators. Hence 
we had lo develop a technique to improve the exact 
lnatch rate particularly with regard to the following 
semantics construction process that depends on 
correct syntactic analyses to produce a correct 
semantic representation of the utterance. 
(Brill, 1993) applied transformation-based 
learning lnethods to natural  processing, 
especially to part-of-speech tagging. He showed 
that it can be effective to let a system make a first 
guess that may be improved or corrected by 
following transformation-based steps. We observed 
many systematical errors in tile output of the 
probabilistic parser, hence we adopted this idea and 
took tile probabilistic shift-reduce parser as the 
guesser and tried to learn tree transformations from 
our training data to improve this first guess. We 
integrated the learned transformations into 
Verbmobil as shown in figure 2. 
The transforlnations map a tree to another tree, 
changing parts that had be identified as incorrect in 
the learning process. The output of the learning 
process are simple Prolog clauses of tile form 
679 
offline 
"utterance" ~Probabilist ic"~ans format ion~ ,.~- ~.~ parser .Ji X rul es ~' ~ 
treeba lexicon 
<~probabilistic h i,j tPeebank t 
parser ~/' utterances ~~ I utterances 1 
~transformation ~~-~-~'~ 
i learning 
semantics construction 
Figure 2 
~ _,~\[" Verbmobil ~ "translated ...Jransla n ~ utterance" 
trans (+InputTree, -OutputTree) :- ! . , 
that are sorted by the number of matches on the 
training corpus. 
4.1 The Problem 
The task of learning transformations that are 
suitable to post-process the output of a probabilistic 
parser can be implemented as shown in figure 2: 
1. train the probabilistic parser on a training set 
O (containing utterances and their human- 
annotated analyses). 
2. parse all utterances of O and save the 
CO~Tesponding parser outputs P. 
3. find the set of as-general-as-possible 
transformations T that map all incorrect trees 
of P into corresponding correct trees in O and 
select the "optimal" transformation from this 
set. 
The first point has been described in section 3.3. 
and the second point is trivial. The as-general-as- 
possible tran,sfonnation is the mapping of a tree of 
P into a tree for the same utterance in O that 
achieves a high degree of generalization and fulfils 
certain conditions, which are explained in section 
4.2. 
1. find the set (\] of all common subtrees of r\[) 
and 0. 
2. find the set ;~ of all potential transformations. 
A transformation t is formed by substitution 
(0i) of one or more elements of ~) by logical 
variables in @ und 0 (i.e. t: 0~(@) ~ 0~(0)) 
3. choose the "optilnal" transformation from ~. 
Syntactical trees are represented as Prolog terms in 
our learning process. Since the transformation 
should be able to map large correct structures in </) 
to their (correct) counterparts in O the first point of 
the algorithm is done by setting (} equal to the set 
of all (Prolog) subtenns that are common in @ and 
0 (i.e. G=subterms (¢\[)) (\]subterms (0))J 
It is crucial here to attach a unique identifier to 
each word (like "l-hi","2-Mr.","3-Smith") because 
one word (like the article "the") could occur several 
times in one sentence and it is important to keep 
those occurences separated for the second step of 
the learning algorithm. 
The second step computes all potential tree 
transformations by substituting one or more 
elements of O in q) and 0 by identical (Prolog) 
variables. In this regard "substitution" is an 
operation, that is inverse to the substitution known 
4.2. The Learning Algorithm 
The learning algorithm to derive the most general 
tree transformations for incorrect trees in O is 
straightforward. To find the most general 
transformation for a source tree @EP to be mapped 
into a destination tree ()cO do: 
subtrees (+Tree, -SubTrees) could simply 
be defined (in Prolog) as 
subtrees(+T,-S) :- findall(X, subtree(X,T) ,S) . 
subtree (S, S) . 
subtree(S,_:L) :- member(M,L) ,subtree(S,N) . 
Trees are represented as terms like a:\[b,c\], for 
exalnple. 
680 
flom predicate logic. 
Choosing tile "optinml" transformation from the 
space of all transl'ormations in the third step is a 
multi-dimensional problem. The dilnensions are: 
• fault tolerance 
• coverage of the training corpus 
, degree of generalization 
lrault tolerance is a parameter that indicates how 
many correction errors on the training corpus the 
human supervisor is willing to tolerate, i.e. how 
many of tile correct parser trees may be transformed 
into incorrect ones. Accepting transfom~ation errors 
may improve the grade of generalization of the 
transformation but for Verbmobil we decided not to 
be fault tolerant. A correct analysis should be kept 
correct in our point of view. 
Coverage o/" the training corpus means lhat if 
step 2 of the learning algorithm has found several 
possible transformations l'or a J)-O-pair the 
transformation tG'77 that covers the most examples 
in P/O shonkl be preferred because this 
transformation is likely to occur more often in the 
rtnlning system or test situation. 
13esides the heuristical generalization criterion of 
coverage of thc training corpus we also introduced 
a formal one. If there are several transfornmtions 
that do not generate errors on the training corpus 
and have exactly the same lnaximuln coverage, we 
select the transformation which has the smallest 
mean distance of its logical variables to the root of 
the tree, because we expect the most general 
transformation to have its variable parts "near the 
root" of the trees. I)istance is measnred in levels 
from the root. For example, tile transformation in 
figure 3 has a mean root distance of the variables 
of ( (1 +2) + (I +3) ) / 4 = 1.75. 
jptt ..... utt-. 
IA\] eX o? @ BI _:px_ 
auf 
Figure 3 
Using this learning algorithm we generate a set 
of optimal transformations for many errors the 
parser produced on the set of training utterances. 
There are still some utterances for which no valid 
transforlnation can be found because all potential 
transforlnations would generate errors on the 
training corpus, what we are not willing to accept. 
5 Evaluation results 
At the time tiffs paper is written we have done 
several experiments on different aspects of out 
work, some of which ate published here. 
5.1. Experiments on context sensitivity 
The question of this experiment was: "We have 
developed a probabilistic parsing model using more 
context information. Does it generate any benefit?" 
To answer this question we trained the parser on 
19,750 german trees and tested on 1,000 (unseen) 
utterances with contexts of different sizes (the 
contexts K3, K4 and K5 am explained in section 
3.3). As shown in figure 4 (the x-axis is a weight 
that controls tile influence of the context in the 
bacldng-off process) labeled precision of the K5- 
parser performs always better than the parsers using 
less context. Labeled recall of the K5-parser is 
superior as long as the large context is not 
overweighted. Higher weigh|s increase some kind 
of "memory effect" so that the trained model does 
not generalize well on (unseen) test data. The 
Ol)timal K5 weight is around 0.1 and 0.2 as you can 
see in figure 4. 
5.2. Evaluation of the probabilistie parser 
We ewduated the parser on German, English and 
Japanese Verbmobil data. The results of this 
ewtluation are given in the following table: 
7)'aining set/trees\] 
Test set \[utterances\] 
GelTilall 
19.750 
1.000 
English 
17.793 
1.000 
Eract Match 46,3% 55,4% 
Incorrect parses 50,3% 39,3% 
Not pmwed 3,4% 5,3% 
contextj'ree rules 988 2.205 
Labeled Precision 90,2% 90,6% 
Labeled Recall (all 83,5% 78,5% 
utterances) 
Labeled Recall 
(parsed utterances) 91,0% 90,9% 
Japan. 
3.218 
300 
67,7% 
21,3% 
! 1,0% 
932 
84,9% 
63,1% 
86,3% 
It is quite interesting that despite of tile low exact 
match rate out parser achieves high precision/recall 
values on parsed utterances. The reason is that we 
have - for the semantics construction process - a 
large number of nomtenninal symbols in out" 
context-fiee grammars and the parser often chooses 
681 
Ix\] 
92 
90 
88 
~86 
oe84 
o 
£. 
a-82 
80 
78 
76 
i 
Labeled Precision K5 -- 
Labeled Precision K4 ..... 
___--~ 
...................... LabeI~O'PF~T~i6P~-K5 ~= ........ = 
.... - ............... Labeled Recall K5 .... 
Labeled Recall K4 ..... 
Labeled Recall K~ ..... 
.... .....i .... "'" .............. ...-''''''" ...... " ........... , .... 
....." 
..,...' 
....." 
...-." 
...... i.i.'.: ......... 
..,..-' 
,.... ........ .: .L-.- _.=: _._ ...... ......................................... 
........ ~ ~'-~ = --- "-- -2 -_-_-2222-- ----2- =- ~=-~z- _. 
o 
I ,I I I I 
0+1 0.2 0.3 0+4 0+5 
Figure 4 
0.6 
only one or two slightly incorrect symbols per 
parse. The mean parsing time per utterance was 
about 400ms for German and English and about 
30ms for Japanese on a 166-Mhz Sun Ultra-I 
workstation. 
5.3. Influence of transformation-based error 
correction 
It is important to have a very high exact match rate 
for the semantics construction process. As showu in 
the table of section 5.2. the exact match rates are 
quite low thus we have learned transformations 
from the training data to improve the output of the 
German and English parser (there was not enough 
training data to do so for Japanese) and evaluated 
the results shown in the following table (TT is an 
abbreviation for Tree Translfbrmations). 
As shown in this table the tree transformations 
improve the exact match rate relatively by 16% for 
German and 10% for English. 
German English 
Exact Match (w/o TT) 46,3% 55,4% 
hlcorrect parses 50,3% 39,3% 
Not parsed 3,4% 5,3% 
Exact Match (after 77) 53,8% 61,2% 
Incorrect parses (after TT) 42,8% 33,5% 
Labeled Precision (w/o 7T) 90,2% 90,6% 
German English 
Labeled Precision (after TT) 90,8% 91,4% 
Labeled Recall (all 83,5% 78,5% 
utterances, w/o TT) 
Labeled Recall (all 84,0% 79,2% 
utterances, after TT) 
Labeled Recall (parsed 91,0% 90,9% 
utterances, w/o TT) 
Labeled Recall (parsed 91,6% 91,7% 
utterances, after TT) 
6 Conclusion 
In this article we have extended probabilistic shift- 
reduce parsing to be more context-sensitive than 
previous works and have demonstrated that a bigger 
context improves the performance of a probabilistic 
shift-reduce parser. It was shown that our model is 
suitable to parse utterances of the Verbmobil 
domain in three different s. It was also 
shown that the exact match rate of a probabilistic 
parser can be improved significantly using a 
symbolic transformation-based post-processing 
step. 
Our method of learning tree transforlnations has 
generated first promising results but it is based on 
the mapping of whole trees to whole trees. It could 
be a direction of further research to extend this 
process of learning transformations on smaller 
682 
(sub-)structures like single phrases. That should 
improve generalization and hel t ) improving the 
exact match rate on the difficult dolnain of parsing 
spontaneously spoken utterances. 
Acknowledgements 
This research was supported by the German Federal 
Ministry for Education, Science, Research and 
Technology under grant no. 01IV701A3. I woukl 
like to thank all Verbmobil colleagues, especially 
the colleagues of IMS Stuttgart and University of 
Tiibingen, who supported this work by their 
cooperation, i would also like to thank the 
anonymous reviewers for their valuable comments. 

References 

Abney, S. P. Palwing by Chunks. In: Berwick, R. C., 
Abney, S. P., Tenny, C. (eds.) l'rincO~le-Based 
Pal;s'ing: Computation and Psycholillguistics. Kluwer 
Academic Publishers, 1991. 

Brill, E. A Coqms-Based Applvach To Lmtguage 
Learning. PhD Thesis, l)ep~}rlment of Computer and 
Information Science, University of Pennsylwmia, 
1993. 

B,iscoe, T., Carroll, J. Generaliz.ed f'robabilistie LR 
Patwiug of Natural Language (Corpora) with 
Ih~/ication-Based GrammaJw. In: Compulational 
Linguistics, Vol. 19, No. l, 1993. 

Briscoe, T., Carroll, J. Apportiotting Develolmteltt EJfort 
itt a Plvbabilistic LR-Palwiltg 5),stent through 
Evaluation. In: l'roceedings of the ACL SIGDAT 
Cmtference on Enqfirical Methods in Natural 
Language Piveessing, Philadelphia, PA. 92-I00, May 
1996. 

Charniak, E. Statistical Language Leartting. MIT Press, 
Cambridge, Mass., 1993. 

lnui, K., Sornlertlamvanich, V., Tanaka, H., Tokunaga, 
T. A New Fotwlaligation of Probabilistic GLR 
Palwing. In: Proceedings of the International 
Workshop on Patwing Technologies, 1997. 

Kiefer, B., Krieger, H.-U., Carroll, J., Malouf, R. A Bag 
of Useful Techniques ./br Efficient attd Robust 
l~cuwing. In: Ptvceedings of the 37th Ammal Meeting 
c!f the Association for Comptttational Linguistics, 
ACL-99, pp. 473-480, 1999. 

Kompe, R., Batliner, A., Block, H.-U., Kiel31ing, A., 
Niemann, H., N6th, E., Ruland, T., Schachtl, S. 
Inq~roviug Patwing of Spontaneous Speech with the 
ttelp of Ptvsodic Boundaries. In: Ptwceediugs of the 
ICASSP, pp. 75-78, Mfinchen, 1997. 

Lavie, A. GLR*: A Robust Grammar-Focused Parser 
for Spontaneously Spoken Lallguage. PhD Thesis, 
Carnegie Mellon University, Pittsburgh, 1996. 

Ruland, T., Rupp, C.J., Spilker, J., Weber, H., Worm, K. 
Makiug the Most of Multiplicity: A Multi-Parser 
Multi-Strategy Architecture for the Robust Processittg 
of Spoken Language. In: Proceedings of the ICSLP, 
Sidney, 1998. 

Schmid, L. Patwing Word Graphs Using a Linguistic 
Grammar and a Statistical lxmguage Model. In: 
Pivceedings of the IEEE htternational CotCerence ou 
Acoustics, Speech attd Signal Processing (ICASSP 
'94), Adelaide, 1994. 

Wahlster, W. Translation of face-to-face dialogs. In: 
ProceediHgs of MT Summit IV, Kobe, Japan, pp. 127- 
135, July 1993. 

Wright, J. H., Wrigley, E. N. GLR Patwing with 
Pivbabilio,. In: Tomita, M. (ed.) Generalised LR 
Palwing. Kluwer Academic Publishers, Boston, 199 I. 
