Automatic Acquisition of Phrase Grammars for Stochastic 
Language Modeling 
Giuseppe Riccardi and Srinivas Bangalore 
AT&T Labs - Research 
180 Park Avenue 
Florham Park, NJ 09732 
{dsp3, srini}@research, att. tom 
Abstract 
Phrase-based language models have been recognized 
to have an advantage over word-based language 
models since they allow us to capture long span- 
ning dependencies. Class based language models 
have been used to improve model generalization and 
overcome problems with data sparseness. In this pa- 
per, we present a novel approach for combining the 
phrase acquisition with class construction process 
to automatically acquire phrase-grammar fragments 
from a given corpus. The phrase-grammar learning 
is decomposed into two sub-problems, namely the 
phrase acquisition and feature selection. The phrase 
acquisition is based on entropy minimization and the 
feature selection is driven by the entropy reduction 
principle. We further demonstrate that the phrase- 
grammar based n-gram language model significantly 
outperforms a phrase-based n-gram language model 
in an end-to-end evaluation of a spoken language 
application. 
1 Introduction 
Traditionally, n-gram language models implicitly as- 
sume words as the basic lexical unit. However, cer- 
tain word sequences (phrases) are recurrent in con- 
strained domain languages and can be thought as a 
single lexical entry (e.g. by and large, I would 
like to, United States of America, etc..). A 
traditional word n-gram based language model can 
benefit greatly by using variable length units to cap- 
ture long spanning dependencies, for any given or- 
der n of the model. Furthermore, language mod- 
eling based on longer length units is applicable to 
languages which do not have a predefined notion of 
a word. However, the problem of data sparseness is 
more acute in phrase-based language models than in 
word-based language models. Clustering words into 
classes has been used to overcome data sparseness 
in word-based language models (et.al., 1992; Kneser 
and Ney, 1993; Pereira et al., 1993; McCandless 
and Glass, 1993; Bellegarda et al., 1996; Saul and 
Pereira, 1997). Although the automatically acquired 
phrases can be later clustered into classes to over- 
come data sparseness, we present a novel approach 
188 
of combining the construction of classes during the 
acquisition of phrases. This integration of phrase 
acquisition and class construction results in the ac- 
quisition of phrase-grammar fragments. In (Gorin, 
1996; Arai et al., 1997), grammar fragment acqui- 
sition is performed through Kullback-Liebler diver- 
gence techniques with application to topic classifica- 
tion from text. 
Although phrase-grammar fragments reduce the 
problem of data sparseness, they can result in over- 
generalization. For example, one of the classes 
induced in our experiments was C1 = {and, but, 
because} which one might call the class of conjunc- 
tions. However, this class was part of a phrase- 
grammar fragment such as A T C1 T which results 
in phrases A T and T, A T but T, A T because 
T - a clear case of over-generalization given our cor- 
pus. Hence we need to further stochastically sepa- 
rate phrases generated by a phrase-grammar frag- 
ment. 
In this paper, we present our approach to integrat- 
ing phrase acquisition and clustering and our tech- 
nique to specialize the acquired phrase fragments. 
We extensively evaluate the effectiveness of phrase- 
grammar based n-gram language model and demon- 
strate that it outperforms a phrase-based n-gram 
language model in an end-to-end evaluation of a spo- 
ken language application. The outline of the paper is 
as follows. In Section 2, we review the phrase acqui- 
sition algorithm presented in (Riccardi et al., 1997). 
In Section 3, we discuss our approach to phrase 
acquisition and clustering respectively. The algo- 
rithm integrating the phrase acquisition and clus- 
tering processes is presented in Section 4. The spo- 
ken language application for automatic call routing 
(How May I Help You? (HMIHY)) that is used for 
evaluating our approach and the results of our ex- 
periments are described in Section 5. 
2 Learning Phrases 
In previous work, we have shown the effectiveness 
of incorporating manually selected phrases for re- 
ducing the test set perplexity 1 and the word error 
rate of a large vocabulary recognizer (Riccardi et 
al., 1995; Riccardi et al., 1996). However, a critical 
issue for the design of a language model based on 
phrases is the algorithm that automatically chooses 
the units by optimizing a suitable cost function. For 
improving the prediction of word probabilities, the 
criterion we used is the minimization of the language 
perplexity PP(T) on a training corpus 7". This al- 
gorithm for extracting phrases from a training cor- 
pus is similar in spirit to (Giachin, 1995), but differs 
in the language model components and optimiza- 
tion parameters (Riccardi et al., 1997). In addition, 
we extensively evaluate the effectiveness of phrase 
n-gram (n > 2) language models by means of an 
end-to-end evaluation of a spoken language system 
(see Section 5). The phrase acquisition method is 
a greedy algorithm that performs local optimization 
based on an iterative process which converges to a 
local minimum of PP(T). As depicted in Figure 1, 
the algorithm consists of three main parts: 
• Generation and ranking of a set of candidate 
phrases. This step is repeated at each iteration 
to constrain the search for all possible symbol 
sequences observed in the training corpus. 
• Each candidate phrase is evaluated in terms of 
the training set perplexity. 
• At the end of the iteration, the set of selected 
phrases is used to filter the training corpus and 
replace each occurrence of the phrase with a 
new lexical unit. The filtered training corpus 
will be referenced as TII. 
In the first step of the procedure, a set of candi- 
date phrases (unit pairs) o is drawn out of a train- 
ing corpus T and ranked according to a correlation 
coefficient. The most used measure for the interde- 
pendence of two events is the mutual information 
MI(z,y) = log P¢~4,1 However, in this experi- P(z)P(y) " 
ment, we use a correlation coefficient that has pro- 
vided the best convergence speed for the optimiza- 
tion procedure: 
P(~' Y) (1) 
P~'~ - P(z) + P(y) 
where P(z) is the probability of symbol z. The coef- 
ficient p~,y (0 _< p~,~ _< 0.5) is easily extended to de- 
fine Pz~,z2 ..... z. for the n-tuple (xl, x2 ..... zn) (0 _< 
p~, ......... _< l/n). Phrases (x, y) with high p~,y or 
~.. P(x) = P(y). In MI(z,y) are such that P(z,y) _ 
lThe perplexity PP(T) of a corpus 7" is PP(T) = 
exp(-~ log P(T)), where n is the number of words in T. 
:aWe ranked symbol pairs and increased the phrase length 
by successive iteration. An additional speed up to the algo- 
rithm could be gained by ran.king symbol k-tuple$ (k > 2) at 
each iteration. 
189 
the case of P(z, y) = P(z) = P(y), px,y = 0.5 while 
MI = -logP(z). Namely, the ranking by MI is 
biased towards events with low probability events 
which are not likely to be selected by our Maximum 
Likelihood algorithm. In fact, the phrase (z,y) F\[ 
Training Set Filtering 
Generate and 
rank candidate phrases 
s elect.atpdhrases V k perp'e  7 ,o,m,za,,o° | 
\ 
$ $ 
I I 
I | 
| | 
| | 
| 
J 
I I 
Figure 1: Algorithm for phrase selection 
1! 
17! 
17 
la ~0 J i ,0o ,so ~o 2;0 
Figure 2: Training set perplexity vs number of se- 
lected phrases using p (solid line) and MI (dashed 
line). 
will be selected only if P(r, y) = P(x) ~_ P(y) and 
the training set perplexity is decreased when (z, y) is 
treated as a single unit. In Figure 2 we show the be- 
havior of the training set perplexity (learning curve) 
by incorporating an increasing number of selected 
phrases using p~,y and MI(x, y) as ranking coeffi- 
cients. In particular, after evaluating 1000 phrases 
and selecting 300 of those, the perplexity decrease is 
20% and 4% using P~4, and MI(z, y) respectively. 
Each of the candidate phrases (z, y) is treated as a 
single unit in order to build a stocha.stic model A of 
k-th (k > 2) order based on the filtered training cor- 
pus, T! a. Then, (z, y) is selected by the algorithm 
if PP~,(T) < P'P(7.). At the end of each iteration 
the set {(~, ~)} is selected and employed to filter the 
training corpus. The algorithm iterates until the 
perplexity" decrease saturates or a specified number 
of phrases have been selected. 4 
The second issue in building a phrase-based lan- 
guage model is the training of large vocabulary 
stochastic finite state machines. In (Riccardi et 
al., 1996) we present a unified framework for learn- 
ing stochastic finite state machines (Variable Ngram 
Stochastic Automata, VNSA) from a given corpus 
7- for large vocabulary tasks. The stochastic finite 
state machine learning algorithm in (Riccardi et al., 
1995) is designed in such a way that it can recognize 
any possible sequence of basic unit while 
• minimizing the number of parameters (states 
and transitions). 
• computing state transition probabilities based 
on word, phrase and class n-grams by imple- 
menting different back-off strategies. 
For the word sequence II e = uq,w2,...,wg, a 
standard word n-gram model provides the following 
probability" decomposition: 
P(IV) = H P(wilwi_n+l ..... wi-x) (2) 
i 
The phrase n-gram model maps from W into a 
bracketed sequence such 
as \[w,\]l , \[w~, u,a\].t~ ..... \[wiv-.~, w,v-t, u,'6,\]f^,. Then, 
the probability P(IV) can be computed as: 
P(W) = IIP(filfi_,~+t ..... fi_l) (3) 
i 
By comparing equations 2 and 3 it is evident how the 
phrase n-gram model allows for an increased right 
and left context in computing P(W). 
In order to evaluate the test perplexity perfor- 
mance of our phrase-based VNSA, we have split the 
How May I Help You? data collection into an 8K 
and 1K training and test set, respectively. In Fig- 
ure 3, the test set perplexity is measured versus the 
VNSA orders for word and phrase language mod- 
els. It is worth noticing that the largest perplex- 
ity decrease comes from using phrase bigram when 
compared against word bigram. Furthermore, the 
perplexity of the phrase models is always lower than 
the corresponding word models. 
3At the fist iteration T ~7" 1. 
4The language model, estimation component of the algo- 
rithm guards against the problem of overfitting (Riccardi et 
al.. 1996). 
190 
21 
28 
19 - -- ~,~! 
17~ 
17 - -- ~'~ ~,~ 
14. 
Z-pin" 
15..6 ISA 
3 &..p~ 
2:word bigram 
2-phr:phrase bigram 
3:word trigram 
3-Fhr:phrase trigram 
Figure 3: Test set perplexity vs VNSA Language 
Model Order 
3 Clustering Phrases 
In the context of language modeling, clustering has 
typically been used on words to induce classes that 
are then used to predict smoothed probabilities of 
occurrence for rare or unseen events in the train- 
ing corpus. Most clustering schemes (et.al., 1992; 
Kneser and Ney, 1993; Pereira et al., 1993; McCan- 
dless and Glass, 1993; Bellegarda et al., 1996; Saul 
and Pereira, 1997) use the average entropy reduction 
to decide when two words fall into the same cluster. 
In contrast, our approach to clustering words is sim- 
ilar to Schutze(1992). The words to be clustered 
are each represented as a feature vector and similar- 
ity between two words is measured in terms of the 
distance between their feature vectors. Using these 
distances, words are clustered to produce a hierar- 
chy. The hierarchy is then cut at a certain depth to 
pi-oduce clusters which are then ranked by a good- 
ness metric. This method assigns each word to a 
unique class, thus producing hard clusters. 
3.1 Vector Representation 
A set of 50 high frequency words from the given 
corpus are designated as the "context words". The 
idea is that the high frequency words will mostly be 
function words which serve as good discriminators 
for content words (certain content words appear only 
with certain function words). 
Each word is associated with a feature vector 
whose components are as follows: 
1. Left context: The coocurrence frequency of 
each of the context word appearing in a window 
of 3 words to the left of the current word is com- 
puted. This determines the distribution of the con- 
text words to the left of the current word within a 
window of 3 words. 
2. Right context: Similarly, the distribution of 
the context words appearing within a window of 3 
words to the right of the current word is computed. 
This leaves us with adjacent wordssharing a lot of 
tile surrounding context and hence might end up 
Class 
Index 
C363 
CI18 
C357 
C260 
C300 
C301 
C277 
C202 
C204 
C77 
C275 
C256 
C197 
C68 
C41 
C199 
C27 
C327 
C48 
C69 
C143 
C89 
C23 
C90 
Compactness Class Members 
Value 
0.131 
0.180 
0.190 
0.216 
0.233 
0.236 
0.241 
0.252 
0.263 
0.268 
0.272 
0.274 
0.278 
0.278 
0.290 
0.291 
0.296 
0.296 
0.299 
0.308 
0.312 
0.314 
0.323 
0.332 
make place 
eight eighty five four nine oh one seven six three two zero 
bill charge 
an and because but so when 
KOok 
from please 
again here 
is it's 
different third 
number numbers 
need needed want wanted 
assistance directory information 
all before happened 
ninety sixty 
his our the their 
called dialed got have 
as by in no not now of or something that that's there whatever working 
I I'm I've 
canada england france germany israel italy japan london mexico paris 
back direct out through 
connected going it 
arizona california carolina florida georgia illinois island jersey maryland michigan missouri 
ohio pennsylvania virginia west york 
be either go see somebody them 
about me off some up you 
Table 1: The results of clustering words from tile How May I Help You ? corpus 
in the same class, s To prevent this situation from 
happening, we include two additional sets of features 
for the immediate left and immediate right contexts. 
Adjacent words then will have different immediate 
context profiles. 
3. Immediate Left context: The distribution of 
the context words appearing to the immediate left 
of the current word. 
4: Immediate Right context: The distribution of 
the context words appearing to the immediate right 
of the current word. 
Thus each word of the vocabulary is represented 
by a 200 component vector. The frequencies of the 
components of the vector are normalized by the fre- 
quency of the word itself. 
The Left and Right features are intended to cap- 
ture the effects of wider range contexts thus col- 
lapsing contexts that differ only due to modifiers, 
while the Immediate Left and Right features are in- 
tended to capture the effects of local contexts. By 
including both sets of features, the effects of the lo- 
cal contexts are weighted more than the effects of 
the wider range contexts, a desirable property. The 
same result might be obtained by weighting the con- 
tributions of individual context positions differently, 
~It is unlikely that with fine grained classes, two words 
belonging to the same class will follow each other. 
with the closest position weighted most heavily. 
3.2 Distance Computation and 
Hierarchical clustering 
Having set up a feature vector for each word, the 
similarity between two words is measured using the 
Manhattan distance metric between their feature 
vectors. Manhattan distance is based on the sum of 
the absolute value of the differences among the coor- 
dinates. This metric is much less sensitive to outliers 
than the Euclidean metric. We experimented with 
other distance metrics such as Euclidean and maxi- 
mum, but Manhattan gave us the best results. 
Having computed the distance matrix, the words 
are hierarchically clustered with a compact linkage 6, 
in which the distance between two clusters is the 
largest distance between a point in one cluster and a 
point in the other cluster(Jain and Dubes, 1988). A 
hierarchical clustering method was chosen since we 
expected to use the hierarchy as a back-off model. 
Also, since we don't know a priori the number of 
clusters we want, we did not use clustering schemes 
such as k-means clustering method where we would 
have to specify the number of clusters from the start. 
6~Ve tried other linkage strategies such as average linkage 
and connected linkage, but compact linkage gave the best 
results. 
191 
Class 
Index 
D365 0.226 
D325 0.232 
D380 0.239 
D386 0.243 
D382 0.276 
D288 0.281 
D186 0.288 
D148 0.315 
D87 0.315 
D183 0.321 
D143 0.326 
D387 0.327 
D4 0.336 
DT0 0.338 
D383 0.341 
D381 0.347 
D159 0.347 
Compactness Class Members 
Value 
wrong:C77 second 
C256:C256 C256 
area:code:C11&C118:Cl18:C118:C118 C68 
a:C77 this:C77 
C260:C357:C143:to:another C260:C357:C143:to:my:home 
C327:C275:to:C363 I'd:like:to:C363 to:C363 yes:I'd:like:to:C363 
good:morning yes:ma'am yes:operator hello hi ma'am may well 
problems trouble 
A:T:C260:T C260:C327 C27:C27 C41:C77 Cl18 C143 C260 
C197 C199 C202 C23 C260 C27 C277 
C301 C69 C77 C90 operator to 
C118:C118:hundred C204 telephone 
new:C89 C48 C89 colorado massachusetts tennessee texas 
my:home my:home:phone 
my:calling my:calling:card my:card 
C199:a:wrong:C77 misdialed 
like:to:C363 trying:to:C363 would:like:to:C363 
like:to:C363:a:collect:call:to like:to:C363:collect:call 
would:like:to:C363:a:collect:call 
would:like:to:C363:a:collect:call:to 
Cl18:Cl18 Cl18:Cl18:Cl18 
C118:C118:C118:C118:CI 18:C118 
C118:C118:Cl18:C118:C118:C118:C118 
C118:C118:C118:Cl18:C118:C118:C118:C118 
C118:Cl18:Cl18:Cl18:C118:Cl18:C118:Cl18:Cl18:C118 
Cl18:C118:Cl18:Cl18:Cl18:Cl18:Cl18:Cl18:C118:CI 18:C118 
area:code:C118:C118:C118 C300 
Table 2: The results of the first iteration of combining phrase acquistion and clustering from tile HMIHY 
corpus. (Words in a phrase are separated by a ":". Tile members of Ci's are shown in Table 1) 
3.2.1 Choosing the number of clusters 
One of the most tricky issues in clustering is the 
choice of the number of clusters after the clustering 
is complete. Instead of predetermining the number 
of clusters to be fixed, we use the median of the 
distances between clusters merged at the successive 
stages as the cutoff and prune the hierarchy at the 
point where the cluster distance exceeds the cutoff 
value. Clusters are defined by the structure of the 
tree above the cutoff point. (Note that the cluster 
distance increases as we climb up the hierarchy). 
3.2.2 Ranking the clusters 
Once the clusters are formed, the goodness of the 
cluster is measured by its compactness value. The 
compactness value of a cluster is simply the average 
distance of the members of the cluster from the cen- 
troid of the cluster. The components of the centroid 
vector is computed as the component-wise average 
of the vector representations of each of the members 
of the cluster. 
The method described above is general in that it 
can be used to either cluster words and phrases. "Fa- 
ble 1 illustrates the result of clustering words and 
Table 2 illustrates tile result of clustering phrases 
for the training data from our application domain. 
For example, the first iteration of the algorithm 
clusters words and the result is shown in Table 1. 
Each word in the corpus is replaced by its class la- 
bel. If the word is not a member of any class then it 
is left unchanged. This transformed corpus is input 
to the phrase acquisition process. Figure 4 shows 
interesting and long phrases that are formed after 
the phrase acquisition process. Table 2 shows the re- 
sult of subsequent clustering of the phrase-annotated 
corpus. 
1. like:to:C363:a:collect:call:to 
2. like:to:C363:collect:call 
3. would:like:to:C363:a:collect:call 
4. would:like:to:C363:a:collect:call:to 
Figure 4: Sample phrases that include class 
label C363={make place}.The components o:\[ a 
phrase are separated by a :. 
192 
4 Learning Phrase Grammar 
In the previous sections we have shown algorithms 
for acquiring (see section 2) and clustering (see 
section 3) phrases. While it is straightforward to 
pipeline the phrase acquisition and clustering algo- 
rithms, in the context of learning phrase-grammars 
they are not separable. Thus, we cannot first learn 
phrases and then cluster them or vice versa. For ex- 
ample, in order to cluster together the phrase cut 
off and disconnected, we first have to learn the 
phrase cut of:f. On the other hand, in order to 
learn the phrase area code :for <city> we first 
have to learn the cluster <city>, containing city 
names (e.g. Boston, New York, etc..). 
Learning phrase grammars can be thought as an it- 
erative process that is composed of two language 
acquisition strategies. The goal is to search those 
features f, sequence of terminal and non-terminal 
symbols, that provide the highest learning rate (the 
entropy reduction within a learning interval, first 
strategy) and minimize the language entropy (sec- 
ond strategy, same as in section 2). 
Phrase Clustm'ing 
Phrase Grammar Learning 
by .. ...... .~ 
Perplexity Minimization 
Figure 5: Algorithm for Phrase-grammar acquisition 
Initially, the set of features f drawn from a corpus 
7" contains terminal symbols V0 only. New features 
can be generated by either 
1. grouping (conjunction operator) an existing set 
of symbols, Vi, into phrases 
or 
2. map an existing set ofsymbols ~ into a new set 
of symbols V/+l (disjunction operator) through 
the categorization provided by the clustering al- 
gorithm. 
The whole symbol space is then given by V = Ui v/ 
as shown in Figure 6 and the problem of learning 
193 
the best feature set is then decomposed into two sub- 
problems: to find the oplimalsubset of V (first learn- 
ing strategy) that gives us the best features (second 
learning strategy) generated by a given set V/. 
v0 
Figure 6: The sequence of symbol sets V/ generated 
by successive clustering steps. 
In order to combine the two optimization prob- 
lems, we have integrated them into a greedy al- 
gorithm as shown in Figure 5. In each algorithm 
iteration we might first cluster the current set of 
phrases and extract a set of non-terminal symbols 
and then acquire the phrases (containing terminal 
and non-terminal symbols) in order to minimize the 
language entropy. XVe use the clustering step of our 
a!gorithm to control the steepness of the learning 
curve within a subset V/of the whole feature space. 
In fact, by varying the clustering rate (number of 
times clustering is performed for an entire acquisi- 
tion experiment) we optimize the reduction of the 
language entropy for each feature selection (entropy 
reduction principle). Thus, the search for the op- 
timal subset of V is designed so as to maximize 
the entropy reduction AH(f) over a set of features 
fl = {fl,f2 .... ,fro} in V/: 
maz~AH(fl) = mazr~\['I~.(T) - fl~,,(T) (4) 
1 to (7) = mart, i (5) 
where/~,1 (7") is the entropy of the corpus 7" based 
on the phrase n-gram model Aft and Ao is the ini- 
tial model and equation 5 follows from equation 4 in 
the sense of the law of large numbers. The search 
space over all possible features f in equation 4 is 
built upon the notion of phrase ranking according 
to the p measure (see Section 2) and phrase cluster- 
ing rate. By varying these two parameters we can 
search for the best learning strategies following the 
greedy algorithm given in Section 2. In Figure 7, 
we give an example of slow and quick learning, de- 
fined by the rate of entropy reduction within an in- 
terval. The discontinuities in the learning curves 
correspond to the clustering algorithm step. The 
maximization in equation 5 is carried out for each 
interval between the entropy discontinuities. There- 
fore. the quick learning strategy provides the ~est 
learning curve in the sense of equation 5. 
i, ....... 
so ~oo ~o ~ ~ 5oo i~o too ~ 1~o ~g,~o 
O,,,,a~ t.~l 12,, ..... \[ 
,I. 1 :t ......... 
Figure 7: Examples of slow and quick phrase- 
grammar learning. 
4.1 Training Language Models for Large 
Vocabulary Systems 
Phrase-grammars allow for an increased generaliza- 
tion. since they can generate phrases that may never 
have been observed in the training corpus, but yet 
similar to the ones that have been observed. This 
generalization property is also used for smoothing 
the word probabilities in the context of stochastic 
language modeling for speech recognition and under- 
standing. Standard class-based models smooth tile 
word n-gram probability P(wi\[wi_n+l,... , Wi-l) in 
the following way: 
P(wilWi-n+l .... , Wi-l) = (6) 
P(CiICi-,~+, ..... Ci-,)P(wilC~) 
where P(CilCi-,~+l,-..,Ci-l) is the class n-gram 
probability and P(wilCi) is the class membership 
probability. However, phrases recognized by the 
same phrase-grammar can actually occur within dif- 
ferent syntactic contexts but their similarity is based 
on their most likely lexical context. In (Riccardi et 
al., 1996) we have developed a context dependent 
training algorithm of the phrase class probabilities. 
In particular, 
P(u, ilwi-,~+l ..... wi-1) = 
P(C, ICi-,~+l .... ,C,-1;S)e(wilC,;S) (7) 
where S is the state of the language model assigned 
by the VNSA model (Riccardi et al., 1996). In 
particular, S = S(wi-n+l .... ,Wi_l;,,~/) is deter- 
mined by the word history wi_n+l,...,Wi_l and 
194 
the phrase-grammar model A I. For example, our al- 
gorithm has acquired the conjunction cluster {but, 
and, because} that leads to generate phrases like 
A T and T or A T because T, the latter clearly an 
erroneous generalization given our corpus. However, 
training context dependent probabilities as shown in 
Equation 7 delivers a stochastic separation between 
the correct and incorrect phrases: 
P(A T and T) 
logp(A T but T)=5"7 (8) 
Given a set of phrases containing terminal and 
non terminal symbols, the goal of large vocabulary 
stochastic language modeling for speech recognition 
and understanding is to assign a probability to all 
terminal symbol sequences. One of the main motiva- 
tion for learning phrase-grammars is to decrease the 
local uncertainty in decoding spontaneous speech 
by embedding tightly constrained structure in the 
large vocabulary automaton. The language mod- 
els trained on the acquired phrase-grammars give a 
slight improvement in perplexity (average measure 
of uncertainty). Another figure of merit in evalu- 
ating a stochastic language model is its local en- 
tropy (-~i P(s, ls)togP(s, ls)) which is related to 
the notion of the branching factor of a language 
model state s. In Figure 8 we plot the local entropy 
histograms for word, phrase and phrase-grammar 
bigram stochastic models. The word bigram dis- 
tribution reflects the sparseness of the word pair 
constraints. The phrase-grammar based language 
model delivers a local entropy distribution skewed 
in tile range \[0- 1\] because of the tight constraints 
enforced by the phrase-grammars. 
..t 
O2 
0 
• s , ,s ,.,,.Lo.,~,..a.~,.,., ~'s 
I-I_ I71_- 
Figure 8: Local entropy histograms for word, phrase 
and phrase-grammar bigram VNSAs. 
5 Spoken Language Application 
We have applied the algorithms for phrase-grammar 
acquisition to the How May I Kelp You? (Gorin et 
al., 1997) speech understanding task. We briefly re- 
view the problem and the spoken language system. 
The goal is to understand caller's responses to the 
open-ended prompt How May I Help ~bu? and route 
such a call based on the meaning of the response. 
Thus we aim at extracting a relatively small num- 
ber of semantic actions from the utterances of a very 
large set of users who are not trained to the system's 
capabilities and limitations. 
The first utterance of each transaction has been 
transcribed and marked with a call-type by label- 
ers. There are 14 call-types plus a class other for 
the complement class. In particular, we focused our 
study on the classification of the caller's first utter- 
ance in these dialogs. The spoken sentences vary 
widely in duration, with a distribution distinctively 
skewed around a mean value of 5.3 seconds corre- 
sponding to 19 words per utterance. Some examples 
of the first utterances are given below: 
• Yes ma'am ~here is area code two zero 
one? 
• I'm tryn'a call and I can't get it I;o 
go through I wondered if you could try 
it for me please? 
• Hello 
In the the training set there are 3,6K words which 
define the lexicon. Tile out-of-vocabulary (OOV) 
rate at the token level is 1.6%, yielding a sentence- 
level OOV rate of 30%. Significantly, only 50 out of 
the I00 lowest rank singletons were cities and names 
while the other were regular words like authorized, 
realized, etc. 
For call type classification from speech we designed 
a large vocabulary one-step speech recognizer utiliz- 
ing the phrase-grammar stochastic (section 4) model 
that achieved 60% word accuracy. Then, we catego- 
rized the decoded speech input into call-types, using 
the salient fragment classifier developed in (Gorin, 
1996; Gorin et al., 1997). The salient phrases have 
the property of modeling local constraints of the 
language while carrying most of the semantic inter- 
pretation of the whole utterance. A block diagram 
of the speech understanding system is given in Fig- 
ure 10. In an automated call router there are two 
important performance measures. The first is the 
probability of false rejection, where a call is falsely 
rejected or classified as other. Since such calls would 
be transferred to a human agent, this corresponds to 
a missed opportunity for automation. The second 
measure is the probability of correct classification. 
Errors in this dimension lead to misinterpretations 
that must be resolved by a dialog manager (Abella 
and Gorin, 1997). In Figure 9, we plot the proba- 
bility of correct classification versus the probabil- 
ity of false rejection, for different speech recogni- 
tion language models and the same classifier (Gorin 
et al., 1997). The curves are generated by vary- 
ing a salience threshold (Gorin, 1996). In a dialog 
195 
system, it would be useful even if the correct call- 
type was one of the top 2 choices of the decision 
rule (Abella and Gorin, 1997). Thus, in Figure 9 the 
classification scores are shown for the first and sec- 
ond ranked call-types identified by the understand- 
ing algorithm. Phrase-grammar trigram model is 
compared to the baseline system which is based on 
the phrase-based stochastic finite state machines de- 
scribed in (Gorin et al., 1997). The phrase-grammar 
model outperforms the baseline phrase-based model. 
and it achieves a 22% classification error rate re- 
duction. The second set of curves (Text) in Figure 9 
give an upper bound on the performance from speech 
experiments. It is worth noting, the rank 2 perfor- 
mance of the phrase-grammar model is aligned with 
rank 1 classification performance on the true tran- 
scriptions (dashed lines). 
100 m 
95 
-~ 85 
ROC CUl%,es lot 1K test set 
. . ..... ~ ~ Rank. 2 
" . -~"~ ~ :" * ~ ~0 Rank 1 
.d ~ R~kZ 
J~ ABtlk I 
op 
I ~ Speed~: Auto-Pt'~ra.se C~mar 
0 20 30 40 50 60 
False relec~on ~te (%) 
Figure 9: Rank 1 and 2 classification rate plot from 
text and speech with phrase-grammar trigram 
I Large roe .~ulary tw 
Classifier 
C 
SFSM~ech~ ~n.e 
St~e Mach~e 
W~cx:led v.*xt miiql 
Figure 10: Block diagram of the speech understand- 
ing system 
6 Conclusions 
In this paper, we have presented a novel approach to 
automatically combine the acquisition of grammar 
fragments for language modeling. The phrase gram- 
mar learning is decomposed into two sub-problems, 
namely the phrase acquisition and feature selection. 
The phrase acquisition is based on entropy mini- 
mization and the feature selection is driven by the 
entropy reduction principle. This integration results 
in the learning of stochastic phrase-grammar frag- 
ments, which are then context dependent trained on 
the corpus at hand. We also demonstrated that a 
phrase-grammar based language model significantly 
outperformed a phrase-based language model in an 
end-to-end evaluation of a spoken language applica- 
tion. 

References 
A. Abella and A. L. Gorin. 1997. Generating se- 
mantically consistent inputs to a dialog manager. 
In Proc. EUROSPEECH, European Conference 
on Speech Communication and Technology, pages 
1879-1882, September. 
K. Arai, J. H. Wright, G. Riccardi, and A. L. Gorin. 
1997. Grammar fragment acquisition using ,;yn- 
tactic and semantic clustering. In Proc. Workshop 
Spoken Language Understanding and Communi- 
cation, December. 
J. Bellegarda, J. Butzberger, Y. Chow, N. Coccaro, 
and D. Naik. 1996. A novel word clustering algo- 
rithm based on latent semantic analysis. In Pro- 
ceedings of ICASSP-96, pages 172-175. 
Brown et.al. 1992. Class-based n-gram models of 
natural language. In Computational Linguistics, 
volume 18(4), pages 467-479. 
E. Giachin. 1995. Phrase bigram for continous 
speech recognition. In Proc. \[EEE Int. Conf. 
Acoust., Speech, Signal Proc., pages 225-228, 
May. 
A. L. Gorin, G. Riccardi, and J. H. Wright. 1997. 
How May I Help You? In Speech Communication, 
volume 23, pages 113-127. 
A. L. Gorin. 1996. Processing of semantic informa- 
tion in fluently spoken language. In Proc. of Intl. 
Conf. on Spoken Language Processing, October. 
A.K. Jain and R.C. Dubes. 1988. Algorithms for 
Clustering Data. Englewood Cliffs, NJ: Prentice- 
Hall. 
Reinhard Kneser and Hermann Ney. 1993. Im- 
proved clustering techniques for class--based sta- 
tistical language modelling. In Eurospeech. 
Michael K. McCandless and James R. Glass. 1993. 
Empirical acquisition of word and phrase classes 
in the atis domain. In In Third European Confer- 
ence on Speech Communication and Technology, 
Berlin. Germany, September. 
Fernando C. N. Pereira, Naftali Z. Tishby, and Lil- 
lian Lee. 1993. Distributional clustering of en- 
glish words. In 30th Annual Meeting of the Asso- 
ciation for Computational Linguistics, pages 183- 
190, Columbus, Ohio. 
G. Riccardi, E. Bocchieri, and R. Pieraccini. 1995. 
Non deterministic stochastic language models for 
speech recognition. In Proc. IEEE Int. Conf. 
Acoust., Speech, Signal Proc., May. 
G. Riccardi, R. Pieraccini, and E. Bocchieri. 1996. 
Stochastic automata for language modeling. In 
Computer Speech and Language, pages 265-293, 
December. 
G. Riccardi, A. L. Gorin, A. Ljolje, and M. Riley. 
1997. A spoken language understanding for au- 
tomated call routing. In Proc. IEEE Int. Conf. 
Acoust., Speech, Signal Proc., pages 1143-1146, 
April. 
L. Saul and F. Pereira. 1997. Aggregate and mixed- 
order markov models for statistical language pro- 
cessing. In Proceedings of the 2nd Conference on 
Empirical Methods in Natural Language Process- 
ing. 
Hinrich Schutze. 1992. Dimensions of meaning. In 
Proceedings of Supercomputing, Minneapolis MN., 
pages 787-796. 
