A Simple Statistical Class Grammar for Measuring Speech Recognition 
Performance 
Alan Derr 
Richard Schwartz 
BBN Systems and Technologies Corporation 
Cambridge, MA 02138 
ABSTRACT 
In this paper we will discuss our development of a new 
grammar that is to be used for evaluation of speech 
recognition systems. The grammar is a statistical first- 
order class grammar and has been developed for for 
two different task domains (the DARPA 1000-word Re- 
source Management domain and a 2000-word personnel 
database domain). We will first motivate the develop- 
ment of this grammar, next describe the grammar and 
its development, and finally present results and conclu- 
sions. 
1 MOTIVATION 
In recent DARPA speech community-wide recognition 
system evaluations, the recognition systems have been 
tested using two grammatical conditions: no grammar 
(or null grammar), and the word-pair grammar. These 
grammars suffer from several inadequacies. 
The null grammar simply forces the recognition sys- 
tem to partition the input speech into whole-word units 
without using any knowledge of the language to place 
restrictions on the possible sequences of words that are 
allowed. As a result, the "no grammar" test condition 
provides only a worst case recognition test point for the 
evaluation of recogmtion systems. 
The word-pair grammar, on the other hand, was de- 
rived from the sentence patterns that were used to gener- 
ate the 2800 sentences in the Resource Management cor- 
pus. Only pairs of words that could occur in a sentence 
generated by the patterns was allowed in the word-pair 
grammar. On the average, each word in the vocabulary 
can be followed by about 60 words. No probabilities 
are assigned to the different words Oust 0 or 1). As 
a result, the recognition rate is artificially high, since 
many reasonable word sequences are disallowed. At the 
same time, ff a real sentence has one of these disallowed 
word-pairs, it could not be recognized correctly. 
As a result of the unrealistic restrictions imposed by 
the word-pair grammar, the recognition performance of 
systems using this grammar is too high to allow reliable 
measurement of system improvements without resorting 
tothe use of very large evaluation test sets. Creating new 
sentences for the test set becomes a problem, since there 
is a danger that new sentences that are within the task 
domain may not be parsed tallowed) by the word-pair 
grammar. 
We desired a grammar that would overcome the de- 
ficiencies of the null and word-pair grammars, while at 
the same time providing several additional benefits. We 
wanted the new grammar to capture statistics that are 
representative of the real data in the task domain while, 
at the same time, providing full coverage (i.e., allowing 
all sentences that are possible within the task domain to 
be parsed by the grammar). We also wanted the gram- 
mar to be "tunable" to some degree, by allowing its 
perplexity to be adjusted. Increasing the glammar's per- 
plexity will allow us to simulate a recognition system's 
performance with a more difficult (e.g., larger) task do- 
main. For this reason we used several approximations in 
the method for estimating the grammar that caused the 
grammar to have higher perplexity. Finally. we wanted 
a grammar that would allow us to change task domains 
with relative ease. 
2 DESCRIPTION 
The grammar that we developed is a statistical first-order 
class grammar m which the probability of a word (W1) 
being followed by another word (W2) is given by: 
P(W21~V1) = E P(CIlWI)P(C21C1)P(W2',C2) 
pathJ 
Where C1 is each of the classes to which WI be- 
longs, and C2 is each of the classes to which W2 be- 
147 
longs. Since each of W I and W2 may belong to mul- 
tiple classes, the summation is over all possible paths 
fxom W1 to W2. This is represented graphically below: 
l 
/ 
PtCl\[ 
• - wo.d ntxlo 
Q - class node w/~a~enea kx~ fte.: 
Note that, in the diagram, the silence at the beginning 
of a sentence ("start silence") and the silence at the end 
of a sentence ("end silence") are simply special cases of 
WI and |V2, respectively, where each is in a separate 
class. The "'class node w/silence loop" indicates that a 
silence may be inserted between each word. 
In our work to date, we have made two simplifying 
assumptions. The conditional probability P(CIiWI) is 
approximated by: 
Nwlgcl -~ for Wl in CI 
P(CI i~VI) = 0 otherwise 
Where Nw~c~ is the number of classes of which 
word W1 is a member. (For example, if a word is a 
member of two classes, PfCIlWI) will be 0.5 for each 
of those classes and 0.0 for all other classes.) A similar 
approximation is made for P(W2--C2), where: 
Nw2Ec2 -1 for W2 in C2 
P(W2 \[C2) = 0 otherwise 
Where NwzEc2 is the number of words in class C2. 
The probabilities P(CIIWD and PfW2!C2) are fixed 
and not changed during the training of the grammar. 
With this simplification, the only term that must be esti- 
mated during the training of the grammar is P(C2!CI), 
the class-to-class transition probabilities. 
3 GRAMMAR TRAINING 
To train the grammar, we began by assigning class(es) 
to each word in the vocabulary. A word may be as- 
signed to multiple classes. For example, the word "SEA- 
WOLF" is assigned to one class: ship-name. On the 
other hand, the word "DISPLAY" is assigned to three 
classes: command-verb, adjectave, and noun. Once the 
words are assigned to appropriate classes, the statistics of 
the grammar were counted directly from the training data 
by counting the number of transitions from each class to 
each other class. These counts were then padded slightly 
(to account for unobserved class-to.class transitions) to 
allow the grammar to parse sentences containing unob- 
served class transitions. Finally, the grammar was tested 
on a test set to measure its perplexity. 
4 GRAMMAR PERFORMANCE 
Below is a summary some of the characteristics of the 
statistical first-order class grammar with 99 classes for 
the DARPA 1000-word Resource Management task do- 
main. with null and word-pair grammar characteristics 
given for comparison. 
~t Perplexity 
Grammar Coverage \[Train \[ Test 
Null 100% 992 992 
Word-pair 80% 60 NA 
Star. class 100% 72 77 
Recog. 
error 
12.6% 
2.5% 
5.9% 
Table 1: Grammar Performance Comparison 
The class grammar figures given here are based on a 
grammar trained using all 2800 sentences available for 
148 
the 1000 word DARPA resource management task do- 
main. The training set perplexity is computed over the 
entire training set and the test set perplexity is computed 
over the 300 sentences used for the May 1988 standard 
system evaluation. The word-pair coverage and perplex- 
ity are approximate theoretical figures assuming an inde- 
pendent test set. The test set perplexity for the word-pair 
is degenerate, since, if a single sentence doesn't parse, 
the perplexity becomes infinite. 
In informal tests, we were able to "tune" the perplexity 
of the grammar by adjusting the number of classes into 
which the words are categorized. On a fixed test set of 
100 sentences and the full training set of 2800 sentences, 
the perplexity varied from 203 (with 50 classes) to 62 
(with 168 classes). 
We have obtained some preliminary results for a sta- 
tistical class grammar designed for a 2170 word per- 
sonnel database access task domain. The grammar uses 
637 classes (l to 5 per word) and is trained using 750 
sentences. The perplexity of this grammar on an inde- 
pendent test set of 200 sentences was measured to be 
89.4. The perplexity measured on the training set was 
46.1. We haven't yet performed a full set of recognition 
experiments using this grammar. 
Acknowledgements 
The work reported here was supported by the Advanced 
Research Projects Agency and was momtored by the Of. 
rice of Naval Research under N00014-85-C-.0279. The 
views and conclusions contained in this document are 
those of the authors and should not be interpreted as 
necessarily representing the official policies, either ex- 
pressed or implied, of the Defense Advanced Research 
Projects Agency or the United States Government. 
5 CONCLUSIONS 
We have described the development of a statistical first- 
order class grammar. The structure of this grammar al- 
lows for relatively easy development of a new grammar 
for a new task domain. The grammar provides for full 
coverage of the task domain, even if all possible class 
sequences are not observable in the data used to train 
the grammar probabilities. It also provides a method for 
adjusting the perplexity of the grammar by varying the 
number of classes in the grammar. 
We recommend that this grammar should be made 
another standard grammar for the DARPA speech com- 
mumty. We believe that this grammar could extend the 
life of the Resource Management task by decreasing the 
recogmtion system's performance while still placing re. 
straints on the possible sequencing of words in a mean- 
ingful way. Decreasing the recognition performance will 
allow the (statistically significant) measurement of small 
system improvements without needing to increase the 
size of the evaluation test. We also recommend that 
this grammar replace the word-pair for official system 
evaluations. 
149 
