File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1122_metho.xml
Size: 25,590 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1122"> <Title>Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling</Title> <Section position="3" start_page="188" end_page="189" type="metho"> <SectionTitle> 2 Learning Phrases </SectionTitle> <Paragraph position="0"> In previous work, we have shown the effectiveness of incorporating manually selected phrases for reducing the test set perplexity 1 and the word error rate of a large vocabulary recognizer (Riccardi et al., 1995; Riccardi et al., 1996). However, a critical issue for the design of a language model based on phrases is the algorithm that automatically chooses the units by optimizing a suitable cost function. For improving the prediction of word probabilities, the criterion we used is the minimization of the language perplexity PP(T) on a training corpus 7&quot;. This algorithm for extracting phrases from a training corpus is similar in spirit to (Giachin, 1995), but differs in the language model components and optimization parameters (Riccardi et al., 1997). In addition, we extensively evaluate the effectiveness of phrase n-gram (n > 2) language models by means of an end-to-end evaluation of a spoken language system (see Section 5). The phrase acquisition method is a greedy algorithm that performs local optimization based on an iterative process which converges to a local minimum of PP(T). As depicted in Figure 1, the algorithm consists of three main parts: * Generation and ranking of a set of candidate phrases. This step is repeated at each iteration to constrain the search for all possible symbol sequences observed in the training corpus.</Paragraph> <Paragraph position="1"> * Each candidate phrase is evaluated in terms of the training set perplexity.</Paragraph> <Paragraph position="2"> * At the end of the iteration, the set of selected phrases is used to filter the training corpus and replace each occurrence of the phrase with a new lexical unit. The filtered training corpus will be referenced as TII.</Paragraph> <Paragraph position="3"> In the first step of the procedure, a set of candidate phrases (unit pairs) o is drawn out of a training corpus T and ranked according to a correlation coefficient. The most used measure for the interdependence of two events is the mutual information MI(z,y) = log PC/~4,1 However, in this experi- P(z)P(y) &quot; ment, we use a correlation coefficient that has provided the best convergence speed for the optimization procedure:</Paragraph> <Paragraph position="5"> where P(z) is the probability of symbol z. The coefficient p~,y (0 _< p~,~ _< 0.5) is easily extended to define Pz~,z2 ..... z. for the n-tuple (xl, x2 ..... zn) (0 _< p~, ......... _< l/n). Phrases (x, y) with high p~,y or ~.. P(x) = P(y). In MI(z,y) are such that P(z,y) _ lThe perplexity PP(T) of a corpus 7&quot; is PP(T) = exp(-~ log P(T)), where n is the number of words in T.</Paragraph> <Paragraph position="6"> :aWe ranked symbol pairs and increased the phrase length by successive iteration. An additional speed up to the algorithm could be gained by ran.king symbol k-tuple$ (k > 2) at each iteration.</Paragraph> <Paragraph position="7"> the case of P(z, y) = P(z) = P(y), px,y = 0.5 while MI = -logP(z). Namely, the ranking by MI is biased towards events with low probability events which are not likely to be selected by our Maximum Likelihood algorithm. In fact, the phrase (z,y) F\[</Paragraph> <Section position="1" start_page="189" end_page="189" type="sub_section"> <SectionTitle> Training Set Filtering Generate and </SectionTitle> <Paragraph position="0"> rank candidate phrases</Paragraph> <Paragraph position="2"> lected phrases using p (solid line) and MI (dashed line).</Paragraph> <Paragraph position="3"> will be selected only if P(r, y) = P(x) ~_ P(y) and the training set perplexity is decreased when (z, y) is treated as a single unit. In Figure 2 we show the behavior of the training set perplexity (learning curve) by incorporating an increasing number of selected phrases using p~,y and MI(x, y) as ranking coefficients. In particular, after evaluating 1000 phrases and selecting 300 of those, the perplexity decrease is 20% and 4% using P~4, and MI(z, y) respectively.</Paragraph> <Paragraph position="4"> Each of the candidate phrases (z, y) is treated as a single unit in order to build a stocha.stic model A of k-th (k > 2) order based on the filtered training corpus, T! a. Then, (z, y) is selected by the algorithm if PP~,(T) < P'P(7.). At the end of each iteration the set {(~, ~)} is selected and employed to filter the training corpus. The algorithm iterates until the perplexity&quot; decrease saturates or a specified number of phrases have been selected. 4 The second issue in building a phrase-based language model is the training of large vocabulary stochastic finite state machines. In (Riccardi et al., 1996) we present a unified framework for learning stochastic finite state machines (Variable Ngram Stochastic Automata, VNSA) from a given corpus 7- for large vocabulary tasks. The stochastic finite state machine learning algorithm in (Riccardi et al., 1995) is designed in such a way that it can recognize any possible sequence of basic unit while * minimizing the number of parameters (states and transitions).</Paragraph> <Paragraph position="5"> * computing state transition probabilities based on word, phrase and class n-grams by implementing different back-off strategies.</Paragraph> <Paragraph position="6"> For the word sequence II e = uq,w2,...,wg, a standard word n-gram model provides the following probability&quot; decomposition:</Paragraph> <Paragraph position="8"> The phrase n-gram model maps from W into a bracketed sequence such as \[w,\]l , \[w~, u,a\].t~ ..... \[wiv-.~, w,v-t, u,'6,\]f^,. Then, the probability P(IV) can be computed as:</Paragraph> <Paragraph position="10"> By comparing equations 2 and 3 it is evident how the phrase n-gram model allows for an increased right and left context in computing P(W).</Paragraph> <Paragraph position="11"> In order to evaluate the test perplexity performance of our phrase-based VNSA, we have split the How May I Help You? data collection into an 8K and 1K training and test set, respectively. In Figure 3, the test set perplexity is measured versus the VNSA orders for word and phrase language models. It is worth noticing that the largest perplexity decrease comes from using phrase bigram when compared against word bigram. Furthermore, the perplexity of the phrase models is always lower than the corresponding word models.</Paragraph> </Section> </Section> <Section position="4" start_page="189" end_page="192" type="metho"> <SectionTitle> Model Order 3 Clustering Phrases </SectionTitle> <Paragraph position="0"> In the context of language modeling, clustering has typically been used on words to induce classes that are then used to predict smoothed probabilities of occurrence for rare or unseen events in the training corpus. Most clustering schemes (et.al., 1992; Kneser and Ney, 1993; Pereira et al., 1993; McCandless and Glass, 1993; Bellegarda et al., 1996; Saul and Pereira, 1997) use the average entropy reduction to decide when two words fall into the same cluster.</Paragraph> <Paragraph position="1"> In contrast, our approach to clustering words is similar to Schutze(1992). The words to be clustered are each represented as a feature vector and similarity between two words is measured in terms of the distance between their feature vectors. Using these distances, words are clustered to produce a hierarchy. The hierarchy is then cut at a certain depth to pi-oduce clusters which are then ranked by a goodness metric. This method assigns each word to a unique class, thus producing hard clusters.</Paragraph> <Section position="1" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 3.1 Vector Representation </SectionTitle> <Paragraph position="0"> A set of 50 high frequency words from the given corpus are designated as the &quot;context words&quot;. The idea is that the high frequency words will mostly be function words which serve as good discriminators for content words (certain content words appear only with certain function words).</Paragraph> <Paragraph position="1"> Each word is associated with a feature vector whose components are as follows: 1. Left context: The coocurrence frequency of each of the context word appearing in a window of 3 words to the left of the current word is computed. This determines the distribution of the context words to the left of the current word within a window of 3 words.</Paragraph> <Paragraph position="2"> 2. Right context: Similarly, the distribution of the context words appearing within a window of 3 words to the right of the current word is computed. This leaves us with adjacent wordssharing a lot of tile surrounding context and hence might end up his our the their called dialed got have as by in no not now of or something that that's there whatever working I I'm I've canada england france germany israel italy japan london mexico paris back direct out through connected going it arizona california carolina florida georgia illinois island jersey maryland michigan missouri ohio pennsylvania virginia west york be either go see somebody them about me off some up you in the same class, s To prevent this situation from happening, we include two additional sets of features for the immediate left and immediate right contexts. Adjacent words then will have different immediate context profiles.</Paragraph> <Paragraph position="3"> 3. Immediate Left context: The distribution of the context words appearing to the immediate left of the current word.</Paragraph> <Paragraph position="4"> 4: Immediate Right context: The distribution of the context words appearing to the immediate right of the current word.</Paragraph> <Paragraph position="5"> Thus each word of the vocabulary is represented by a 200 component vector. The frequencies of the components of the vector are normalized by the frequency of the word itself.</Paragraph> <Paragraph position="6"> The Left and Right features are intended to capture the effects of wider range contexts thus collapsing contexts that differ only due to modifiers, while the Immediate Left and Right features are intended to capture the effects of local contexts. By including both sets of features, the effects of the local contexts are weighted more than the effects of the wider range contexts, a desirable property. The same result might be obtained by weighting the contributions of individual context positions differently, ~It is unlikely that with fine grained classes, two words belonging to the same class will follow each other. with the closest position weighted most heavily.</Paragraph> </Section> <Section position="2" start_page="189" end_page="192" type="sub_section"> <SectionTitle> 3.2 Distance Computation and Hierarchical clustering </SectionTitle> <Paragraph position="0"> Having set up a feature vector for each word, the similarity between two words is measured using the Manhattan distance metric between their feature vectors. Manhattan distance is based on the sum of the absolute value of the differences among the coordinates. This metric is much less sensitive to outliers than the Euclidean metric. We experimented with other distance metrics such as Euclidean and maximum, but Manhattan gave us the best results.</Paragraph> <Paragraph position="1"> Having computed the distance matrix, the words are hierarchically clustered with a compact linkage 6, in which the distance between two clusters is the largest distance between a point in one cluster and a point in the other cluster(Jain and Dubes, 1988). A hierarchical clustering method was chosen since we expected to use the hierarchy as a back-off model.</Paragraph> <Paragraph position="2"> Also, since we don't know a priori the number of clusters we want, we did not use clustering schemes such as k-means clustering method where we would have to specify the number of clusters from the start.</Paragraph> <Paragraph position="3"> 6~Ve tried other linkage strategies such as average linkage and connected linkage, but compact linkage gave the best results.</Paragraph> <Paragraph position="4"> corpus. (Words in a phrase are separated by a &quot;:&quot;. Tile members of Ci's are shown in Table 1) 3.2.1 Choosing the number of clusters One of the most tricky issues in clustering is the choice of the number of clusters after the clustering is complete. Instead of predetermining the number of clusters to be fixed, we use the median of the distances between clusters merged at the successive stages as the cutoff and prune the hierarchy at the point where the cluster distance exceeds the cutoff value. Clusters are defined by the structure of the tree above the cutoff point. (Note that the cluster distance increases as we climb up the hierarchy).</Paragraph> <Paragraph position="5"> Once the clusters are formed, the goodness of the cluster is measured by its compactness value. The compactness value of a cluster is simply the average distance of the members of the cluster from the centroid of the cluster. The components of the centroid vector is computed as the component-wise average of the vector representations of each of the members of the cluster.</Paragraph> <Paragraph position="6"> The method described above is general in that it can be used to either cluster words and phrases. &quot;Fable 1 illustrates the result of clustering words and Table 2 illustrates tile result of clustering phrases for the training data from our application domain.</Paragraph> <Paragraph position="7"> For example, the first iteration of the algorithm clusters words and the result is shown in Table 1.</Paragraph> <Paragraph position="8"> Each word in the corpus is replaced by its class label. If the word is not a member of any class then it is left unchanged. This transformed corpus is input to the phrase acquisition process. Figure 4 shows interesting and long phrases that are formed after the phrase acquisition process. Table 2 shows the result of subsequent clustering of the phrase-annotated corpus.</Paragraph> </Section> </Section> <Section position="5" start_page="192" end_page="194" type="metho"> <SectionTitle> 4 Learning Phrase Grammar </SectionTitle> <Paragraph position="0"> In the previous sections we have shown algorithms for acquiring (see section 2) and clustering (see section 3) phrases. While it is straightforward to pipeline the phrase acquisition and clustering algorithms, in the context of learning phrase-grammars they are not separable. Thus, we cannot first learn phrases and then cluster them or vice versa. For example, in order to cluster together the phrase cut off and disconnected, we first have to learn the phrase cut of:f. On the other hand, in order to learn the phrase area code :for <city> we first have to learn the cluster <city>, containing city names (e.g. Boston, New York, etc..).</Paragraph> <Paragraph position="1"> Learning phrase grammars can be thought as an iterative process that is composed of two language acquisition strategies. The goal is to search those features f, sequence of terminal and non-terminal symbols, that provide the highest learning rate (the entropy reduction within a learning interval, first strategy) and minimize the language entropy (second strategy, same as in section 2).</Paragraph> <Paragraph position="2"> Initially, the set of features f drawn from a corpus 7&quot; contains terminal symbols V0 only. New features can be generated by either 1. grouping (conjunction operator) an existing set of symbols, Vi, into phrases or 2. map an existing set ofsymbols ~ into a new set of symbols V/+l (disjunction operator) through the categorization provided by the clustering algorithm. null The whole symbol space is then given by V = Ui v/ as shown in Figure 6 and the problem of learning the best feature set is then decomposed into two subproblems: to find the oplimalsubset of V (first learning strategy) that gives us the best features (second learning strategy) generated by a given set V/. by successive clustering steps.</Paragraph> <Paragraph position="3"> In order to combine the two optimization problems, we have integrated them into a greedy algorithm as shown in Figure 5. In each algorithm iteration we might first cluster the current set of phrases and extract a set of non-terminal symbols and then acquire the phrases (containing terminal and non-terminal symbols) in order to minimize the language entropy. XVe use the clustering step of our a!gorithm to control the steepness of the learning curve within a subset V/of the whole feature space. In fact, by varying the clustering rate (number of times clustering is performed for an entire acquisition experiment) we optimize the reduction of the language entropy for each feature selection (entropy reduction principle). Thus, the search for the optimal subset of V is designed so as to maximize the entropy reduction AH(f) over a set of features</Paragraph> <Paragraph position="5"> where/~,1 (7&quot;) is the entropy of the corpus 7&quot; based on the phrase n-gram model Aft and Ao is the initial model and equation 5 follows from equation 4 in the sense of the law of large numbers. The search space over all possible features f in equation 4 is built upon the notion of phrase ranking according to the p measure (see Section 2) and phrase clustering rate. By varying these two parameters we can search for the best learning strategies following the greedy algorithm given in Section 2. In Figure 7, we give an example of slow and quick learning, defined by the rate of entropy reduction within an interval. The discontinuities in the learning curves correspond to the clustering algorithm step. The maximization in equation 5 is carried out for each interval between the entropy discontinuities. Therefore. the quick learning strategy provides the ~est learning curve in the sense of equation 5.</Paragraph> <Paragraph position="6"> i, .......</Paragraph> <Paragraph position="7"> so ~oo ~o ~ ~ 5oo i~o too ~ 1~o ~g,~o O,,,,a~ t.~l 12,, ..... \[ ,I. 1 :t .........</Paragraph> <Section position="1" start_page="193" end_page="194" type="sub_section"> <SectionTitle> 4.1 Training Language Models for Large Vocabulary Systems </SectionTitle> <Paragraph position="0"> Phrase-grammars allow for an increased generalization. since they can generate phrases that may never have been observed in the training corpus, but yet similar to the ones that have been observed. This generalization property is also used for smoothing the word probabilities in the context of stochastic language modeling for speech recognition and understanding. Standard class-based models smooth tile word n-gram probability P(wi\[wi_n+l,... , Wi-l) in the following way:</Paragraph> <Paragraph position="2"> where P(CilCi-,~+l,-..,Ci-l) is the class n-gram probability and P(wilCi) is the class membership probability. However, phrases recognized by the same phrase-grammar can actually occur within different syntactic contexts but their similarity is based on their most likely lexical context. In (Riccardi et al., 1996) we have developed a context dependent training algorithm of the phrase class probabilities.</Paragraph> <Paragraph position="3"> In particular, P(u, ilwi-,~+l ..... wi-1) = P(C, ICi-,~+l .... ,C,-1;S)e(wilC,;S) (7) where S is the state of the language model assigned by the VNSA model (Riccardi et al., 1996). In particular, S = S(wi-n+l .... ,Wi_l;,,~/) is determined by the word history wi_n+l,...,Wi_l and the phrase-grammar model A I. For example, our algorithm has acquired the conjunction cluster {but, and, because} that leads to generate phrases like A T and T or A T because T, the latter clearly an erroneous generalization given our corpus. However, training context dependent probabilities as shown in Equation 7 delivers a stochastic separation between the correct and incorrect phrases: P(A T and T) logp(A T but T)=5&quot;7 (8) Given a set of phrases containing terminal and non terminal symbols, the goal of large vocabulary stochastic language modeling for speech recognition and understanding is to assign a probability to all terminal symbol sequences. One of the main motivation for learning phrase-grammars is to decrease the local uncertainty in decoding spontaneous speech by embedding tightly constrained structure in the large vocabulary automaton. The language models trained on the acquired phrase-grammars give a slight improvement in perplexity (average measure of uncertainty). Another figure of merit in evaluating a stochastic language model is its local entropy (-~i P(s, ls)togP(s, ls)) which is related to the notion of the branching factor of a language model state s. In Figure 8 we plot the local entropy histograms for word, phrase and phrase-grammar bigram stochastic models. The word bigram distribution reflects the sparseness of the word pair constraints. The phrase-grammar based language model delivers a local entropy distribution skewed in tile range \[0- 1\] because of the tight constraints enforced by the phrase-grammars.</Paragraph> </Section> </Section> <Section position="6" start_page="194" end_page="194" type="metho"> <SectionTitle> 5 Spoken Language Application </SectionTitle> <Paragraph position="0"> We have applied the algorithms for phrase-grammar acquisition to the How May I Kelp You? (Gorin et al., 1997) speech understanding task. We briefly review the problem and the spoken language system.</Paragraph> <Paragraph position="1"> The goal is to understand caller's responses to the open-ended prompt How May I Help ~bu? and route such a call based on the meaning of the response.</Paragraph> <Paragraph position="2"> Thus we aim at extracting a relatively small number of semantic actions from the utterances of a very large set of users who are not trained to the system's capabilities and limitations.</Paragraph> <Paragraph position="3"> The first utterance of each transaction has been transcribed and marked with a call-type by labelers. There are 14 call-types plus a class other for the complement class. In particular, we focused our study on the classification of the caller's first utterance in these dialogs. The spoken sentences vary widely in duration, with a distribution distinctively skewed around a mean value of 5.3 seconds corresponding to 19 words per utterance. Some examples of the first utterances are given below: * Yes ma'am ~here is area code two zero one? * I'm tryn'a call and I can't get it I;o go through I wondered if you could try it for me please?</Paragraph> </Section> <Section position="7" start_page="194" end_page="195" type="metho"> <SectionTitle> * Hello </SectionTitle> <Paragraph position="0"> In the the training set there are 3,6K words which define the lexicon. Tile out-of-vocabulary (OOV) rate at the token level is 1.6%, yielding a sentence-level OOV rate of 30%. Significantly, only 50 out of the I00 lowest rank singletons were cities and names while the other were regular words like authorized, realized, etc.</Paragraph> <Paragraph position="1"> For call type classification from speech we designed a large vocabulary one-step speech recognizer utilizing the phrase-grammar stochastic (section 4) model that achieved 60% word accuracy. Then, we categorized the decoded speech input into call-types, using the salient fragment classifier developed in (Gorin, 1996; Gorin et al., 1997). The salient phrases have the property of modeling local constraints of the language while carrying most of the semantic interpretation of the whole utterance. A block diagram of the speech understanding system is given in Figure 10. In an automated call router there are two important performance measures. The first is the probability of false rejection, where a call is falsely rejected or classified as other. Since such calls would be transferred to a human agent, this corresponds to a missed opportunity for automation. The second measure is the probability of correct classification.</Paragraph> <Paragraph position="2"> Errors in this dimension lead to misinterpretations that must be resolved by a dialog manager (Abella and Gorin, 1997). In Figure 9, we plot the probability of correct classification versus the probability of false rejection, for different speech recognition language models and the same classifier (Gorin et al., 1997). The curves are generated by varying a salience threshold (Gorin, 1996). In a dialog system, it would be useful even if the correct call-type was one of the top 2 choices of the decision rule (Abella and Gorin, 1997). Thus, in Figure 9 the classification scores are shown for the first and second ranked call-types identified by the understanding algorithm. Phrase-grammar trigram model is compared to the baseline system which is based on the phrase-based stochastic finite state machines described in (Gorin et al., 1997). The phrase-grammar model outperforms the baseline phrase-based model.</Paragraph> <Paragraph position="3"> and it achieves a 22% classification error rate reduction. The second set of curves (Text) in Figure 9 give an upper bound on the performance from speech experiments. It is worth noting, the rank 2 performance of the phrase-grammar model is aligned with rank 1 classification performance on the true transcriptions (dashed lines).</Paragraph> </Section> class="xml-element"></Paper>