A HYBRID APPROACH TO 
ADAPTIVE STATISTICAL LANGUAGE MODELING 
Ronald Rosenfeld 
School of Computer Science 
Carnegie Mellon University 
Pittsburgh, PA 12513 
ABSTRACT 
We desert'be our latest attempt at adaptive language modeling. At 
the heart of our approach is a Maximum Entropy (ME) model which 
inc.orlxnates many knowledge sources in a consistent manner. The 
other components are a selective unigram cache, a conditional bigram 
cache, and a conventionalstatic trigram. We describe the knowledge 
sources used to build such a model with ARPA's official WSJ corpus, 
and report on perplexity and word error rate results obtained with 
it. Then, three different adaptation paradigms are discussed, and an 
additional experiment, based on AP wire data, is used to compare 
them. 
1. OVERVIEW OF ME FRAMEWORK 
Using several different probability estimates to arrive at one 
combined estimate is a general problem that arises in many 
tasks. The Maximum Entropy (ME) principle has recently 
been demonstrated as a powerful tool for combining statistical 
estimates from diverse sources\[l, 2, 3\]. The ME principle 
(\[4, 5\]) proposes the following: 
1. Reformulate the different estimates as constraints on the 
expectation of various functions, to be satisfied by the 
target (combined) estimate. 
2. Among all probability distributions that satisfy these con- 
straints, choose the one that has the highest entropy. 
More specifically, for estimating a probability function P(x), 
each constraint i is associated with a constraintfunctionfi(x) 
and a desired expectation ci. The constraint is then written as: 
def E Eefi = P(x)fi(x) = ci. (1) 
X 
Given consistent constraints, a unique ME solutions is guar- 
anteed to exist, and to be of the form: 
P(x) = II mf'°°, (2) 
i 
where the pi's are some unknown constants, to be found. 
Probability functions of the form (2) are called log-linear, 
and the family of functions defined by holding thefi's fixed 
and varying the pi's is called an exponential family. 
76 
TO search the family defined by (2) for the pi's that will make 
P(x) satisfy all the constraints, an iterative algorithm, "Gen- 
eralized Iterative Scaling" (GIS), exists, which is guaranteed 
to converge to the solution (\[6\]), as long as the constraints 
are mut~ally consistent. GIS starts with arbitrary p~ values. 
At each iteration, it computes the expectations Epfi over the 
training data, compares them to the desired values c/s, and 
then adjusts the tJz's by an amount proportional to the ratio of 
the two. 
Generalized Iterative Scaling can be used to find the ME 
estimate of a simple (non-conditional) probability distribution 
over some event space. An ~0aptation of GIS to conditional 
probabilities was proposed by \[7\], as follows. Let P(w\[h) 
be the desired probability estimate, and let lS(h,w) be the 
empirical distribution of the training data. Letfi(h,w) be 
any constraint function, and let cl be its desired expectation. 
Equation 1 is now modified to: 
E P(h)" E P(w\[h) .fi(h, w) = ci (3) 
h w 
See also \[1, 2\]. 
2. CAPTURING LONG-DISTANCE 
LINGUISTIC PHENOMENA 
The ME framework is very general, freeing the modeler to 
concentrate on searching for significant information sources 
and choosing the phenomena to be modeled. In statistical 
language modeling, we are interested in information about 
the identity of the next word, wi, given the history h, namely 
the part of the document that was already processed by the 
system. We have so far considered the following information 
sources, all contained within the history: 
Conventional N-grams: the immediately preceding few 
words, say (wi-2, wi-l). 
Long distance N-grams\[8\]: N-grams preceding wi byjpo- 
sitions. 
triggers\[9\]: the appearance in the history of words related 
to wi. 
class triggers: trigger relations among word clusters. 
count-based cache: the number of times wi already oc- 
curred in the history. 
distance-based cache: the last time wi occurred in the his- 
tory. 
linguistically defined constraints: number agreement, 
tense agreement, etc. 
Any potential source can be considered separately, and the 
amount of information in it estimated. For example, in esti- 
mating the potential of count-based caches, we might measure 
dependencies of the form depicted in figure 1, and calculate 
the amount of information they may provide. See also \[3\]. 
P( DEFAULT ) 
Similarly, the constraint function for the bigram wt, w2 is 
1 ffhendsinwl andw=w2 
f~,,n(h,w)= 0 otherwise (6) 
and its associated constraint is 
~P(h) ~ P(wlh)f ~,)n(h,w) = ~f ~,,a(h,w). 
h w 
(7) 
and similarly for higher-order N-grams. 
2.2. Formulating long-distance N-grams as 
Constraints 
The constraint functions for long distance N-grams are very 
similar to those for conventional (distance 1) N-gram. For 
example, the constrain function for the distance-2 trigram 
{wl, w2, w3} is: 
o l 2 3 4 5+ (2(DEFAUI~T) 
Figure 1: Count-basedcache information: Probabilityof'DE- 
FAULT' as a function of the number of times it already oc- 
curred in the document. The horizontal line is the uncondi- 
tional probability. 
Perhaps the most important feature of the Maximum Entropy 
framework is its extreme generality. For any conceivable 
linguistic or statistical phenomena, appropriate constraint 
functions can readily be written. We will demonstrate this 
process for several of the knowledge sources listed above. 
2.1. Formulating N-grams as Constraints 
The usual unigram, bigram and trigram Maximum Likelihood 
estimates can be replaced by unigram, bigrarn and trigram 
constraints conveying the same information. Specifically, the 
constraint function for the unigram wl is: 
1 ifw=wl 
f~ (h,w) = 0 otherwise (4) 
and its associated constraint is: 
P(h) ~ P(wlh~f~(h,w) = Ef w,(h,w). 
h w 
1 
f~,~,~ (h, w) = and w ffi w3 
0 otherwise 
ff h ends in {wl, w2, w* } for some w*, 
and its associated constraint is 
l~(h) ~ P(wlh)f ~,~a,~(h, w) = l~f ~,,a,~(h, w). 
h w 
(s) 
(9) 
and similarly for other long distance N-grams. 
2.3. Formulating Triggers as Constraints 
For class triggers, let A, B be two related word clusters. Define 
the constraint functionfa.~ as: 
I ff3wjEA, wjEh, wEB (10) f A..~(h, w) = 0 
otherwise 
Set CA--~ tO E\[\]'~-~S\], the empirical expectation offA--~ (i.e, 
its expectation in the training data). NOW the constraint on 
P(h, w) is: 
Ee \[fA-~\] = i~tf~-~\] (11) 
(5) 
3. SELECTIVE UNIGRAM CACHE 
In a document-based unigram cache, all words that occurred 
in the history of the document are stored, and are used to 
dynamically generate a unigram, which is in turn combined 
with other language model components. N-gram caches were 
first reported by \[10\]. 
The motivation behind a unigram cache is that, once a word 
occurs in a document, its probability of re-occurring is typ- 
ically greatly elevated. But the extent of this phenomenon 
77 
depends on the prior frequency of the word, and is most pro- 
nounced for rare words. The occurrence of a common word 
like "DIE" provides little new information. Put another way, 
the occurrence of a rare word is more surprising, and hence 
provides more information, whereas the occurrence of a more 
common word deviates less from the expectations of the static 
model, and therefore requires a smaller modification to it. 
Bayesian analysis may be used to optimally combine the prior 
of a word with the new evidence provided by its occurrence. 
As a rough first approximation, we implemented a selective 
unigram cache, where only rare words are stored in the cache. 
A word is defined as rare relative to a threshold of static 
unigram frequency. The exact value of the threshold was 
determined by optimizing perplexity on unseen data. This 
scheme proved more useful for perplexity reduction than the 
conventional cache. 
4. CONDITIONAL BIGRAM AND 
TRIGRAM CACHES 
In a document-based bigram cache, all consecutive word pairs 
that occurred in the history of the document are stored, and 
are used to dynamically generate a bigram, which is in turn 
combined with other language model components. A trigram 
cache is similar but is based on all consecutive word triples. 
An alternative way of viewing a bigram cache is as a set of 
unigram caches, one for each word in the history. At most 
one such unigram is consulted at any one time, depending 
on the identity of the last word of the history. Viewed this 
way, it is clear that the bigram cache should contribute to the 
combined model only if the last word of the history is a (non- 
selective) unigram "cache hit". In all other cases, the uniform 
distribution of the bigram cache would only serve to flatten, 
hence degrade, the combined estimate. 
We therefore chose to use a conditional bigram cache, which 
has a non-zero weight only during such a "hit". 
A similar argument can be applied to the trigram cache. Such 
a cache should only be consulted if the last two words of 
the history occurred before, i.e. the trigram cache should 
contribute only immediately following a bigram cache hit. We 
experimented with such a trigram cache, constructed similarly 
to the conditional bigram cache. However, we found that 
it contributed little to perplexity reduction. This is to be 
expected: every bigram cache hit is also a unigram cache hit. 
Therefore, the trigram cache can only refine the distinctions 
already provided by the bigram cache. A document's history 
is typically small (225 words on average in the WSJ corpus). 
For such a modest cache, the refinement provided by the 
trigram is small and statistically unreliable. 
Another way of viewing the selective bigram and trigram 
caches is as regular (i.e. non-selective) caches, which are 
later interpolated using weights that depend on the count of 
their context. Then, zero context-counts force respective zero 
weights. 
5. THE WSJ SYSTEM 
As a testbed for the above ideas, we used ARPA's CSR task. 
The training data was 38 million words of Wall Street Jour- 
nal OVSJ) text from 1987-1989. The vocabulary used was 
ARPA's official "20o.nvp" (20,000 most common WSJ words, 
non-verbalized punctuation). 
To measure the impact of the amount of training d,t~ on 
language model adaptation, we experimented with systems 
based on varying amounts of training d~t~= The largest model 
we built was based on the entire 38M words of WSJ training 
data, and is described below. 
5.1. The Component Models 
The adaptive language model was based on four component 
language models: 
. 
. 
A conventional "compact" backoff trigram model. 
"Compact" here means that singleton trigrams (word 
triplets that occurred only once in the training d~ta) were 
excluded from the model. It consisted of 3.2 million tri- 
grams and 3.5 million bigrams. This model also served 
as the baseline for comparisons, and was dubbed "the 
static model". 
A Maximum En~opy model trained on the same d a!8 as 
the trigram, and consisting of the following knowledge 
sources: 
• High cutoff, distance-1 (conventional) N-grams: 
- All trigrams that occurred 9 or more times in 
the training data (428,000 in all). 
- All bigrams that occurred 9 or more times in 
the training data (327,000). 
- all unigrams. 
The high cutoffs were necessary in order to reduce 
the heavy computational requirements of the train- 
ing procedure. 
• High cutoff, distance-2 bigrams and trigrams: 
- All distance-2 trigrams that occurred 5 or more 
times in the training data (795,000 in all). 
- All distance-2 bigrams that occurred 5 or more 
times in the training data (651,000). 
The cutoffs used for the conventional N-grams 
were higher than those applied to the distance-2 
N-grams. This was done because we expected that 
the information lost from the former knowledge 
78 
source will be re-introduced, at least partially, by 
interpolation with the static model. 
• Word Trigger Pairs: For every word in the vocabu- 
lary, the top 3 triggers were selected based on their 
mutual information with that word as computed 
from the training data\[l, 2\]. This resulted in some 
43,000 word trigger pairs. 
3. A selective unigram cache, as described earlier, using a 
unigram threshold of 0.001. 
4. A conditional bigram cache, as described earlier. 
5.2. Combining the LM Components 
The combined model was achieved by consulting an appropri- 
ate subset of the above four models. At any one time, the four 
component LMs were combined linearly. But the weights 
used were not fixed, nor did they follow a linear pattern over 
time. 
Since the Maximum Entropy model incorporated information 
from trigger pairs, its relative weight should be increased with 
the length of the history. But since it also incorporated new 
information from distance-2 N-grams, it is useful even at the 
very beginning of a document, and its weight should not start 
at zero. 
5.4. Computational Costs 
The computational bottleneck of the Generalized Iterative 
Scaling algorithm is in constraints which, for typical histo- 
ties h, are non-zero for a large number of words w's. This 
means that bigram constraints are more expensive than trigram 
constraints. Implicit computation can be used for unigram 
constraints. Therefore, the time cost of bigram and trigger 
constraints dominated the total time cost of the algorithm. 
The computational burden of training the Maximum Entropy 
model for the large system (38MW) was quite severe. For- 
tunately, the training procedure is highly paralleliTable (see 
\[1\]). Training was run in parallel on 10-25 high performance 
workstations, with an average of perhaps 15 machines. Even 
so, it took 3 weeks to complete. 
In comparison, training the 5MW system took only a few 
machine-days, and training the 1MW system was trivial. 
5.5. Perplexity Reduction 
We used 325,000 words of unseen WSJ d~tg_ to measure per- 
plexities of the baseline trigram model, the Maximum En- 
tropy component, and the interpolated a0aptive model (the 
latter consisting of the first two together with the unigram and 
bigram caches). This was done for each of the three systems 
(38MW, 5MW and 1MW). Results are summarized in table 1. 
We therefore started the Maximum Entropy model with a 
weight of ,,.,0.3, which was gradually increased over the first 
60 words of the document, to ~0.7. The conventional trigram 
started with a weight of,,4).7, and was decreased concurrently 
to ~0.3. The conditional bigram cache had a non-zero weight 
only during a cache hit, which allowed for a relatively high 
weight of ,~,0.09. The selective unigram cache had a weight 
proportional to the size of the cache, saturating at -,,0.05. The 
weights were always normalized to sum to 1. 
While the general weighting scheme was chosen based on con- 
siderations discussed above, the specific values of the weights 
were chosen by minimizing perplexity of unseen data. It be- 
came clear later that this did not always correspond with mini- 
mizing error rate. Subsequently, further weight modifications 
were determined by direct trial-and-error measurements of 
word error rate on development data. 
5.3. Varying the Training Data 
As mentioned before, we also experimented with systems 
based on less training data. We built two such systems, one 
based on 5 million words, and the other based on 1 million 
words. Both systems were identical to the larger systems 
described above, except that the Maximum Entropy model 
did not employ high cutoffs, but was instead based on the 
same N-gram information as the conventional trigram model. 
amt. of training data 1M 5M 38M 
trigram (baseline) 
perplexity 269 173 105 
Maximum Entropy 
perplexity 203 123 86 
PP reduction 24% 29% 18% 
interpolated model 
perplexity 163 108 71 
PP reduction 39% 38% 32% 
Table 1: Perplexity (PP) improvement of Maximum Entropy 
and interpolated adaptive models over a conventional trigram 
model, for varying amounts of training data. The 38MW ME 
model used far fewer parameters than the baseline, since it 
employed high N-gram cutoffs. See texL 
As can be observed, the Maximum Entropy model, even when 
used alone, was significantly better than the static model. 
Its relative advantage seems greater with more training data. 
With the large (38MW) system, practical consideration re- 
quired imposing high cutoffs on the ME model, and yet its 
perplexity is still significantly better than that of the baseline. 
This is particularly notable because the ME model uses only 
one third the number of parameters used by the trigram model 
(2.26M vs. 6.72M). 
79 
When the Maximum Entropy model is supplemented with the 
other three components, perplexity is again reduced signifi- 
cantly. Here the relationship with the amount of training data 
is reversed: the less training data, the greater the improve- 
ment. This effect is due to the caches, and can be explained as 
follows: The amount of information provided by the caches 
is independent of the amount of training data, and is therefore 
fixed aCTOSS the three systems. However, the 1MW system 
has higher perplexity, and therefore the relative improvement 
provided by the caches is greater. Put another way, mod- 
els based on more data are stronger, and therefore harder to 
improve on. 
5.6. Error Rate Reduction 
To evaluate error rate reduction, we used the Nov93 ARPA 
S1 evaluation set\[ll, 12, 13\]. It consisted of 424 utter- 
ances produced in the context of complete long documents 
by two male and two female speakers. We used the SPHINX- 
II recognizer(J14, 15, 16\]) with sex-dependent non-PD 10K 
senone acoustic models. In addition to the 20K words in 
the lexicon, 178 OOV words and their correct phonetic tran- 
scriptions were added in order to create closed vocabulary 
conditions. We first ran the forward and backward passes of 
SPHINX H to create word lattices, which were then used by 
three independent A* passes. The first such pass used the 
38MW static trigram language model. The other two passes 
used the 38MW interpolated adaptive LM. The first of these 
two adaptive runs was for unsupervised word-by-word adap- 
tation, in which the decoder output was used to update the 
language model. The other run used supervised adaptation, 
in which the decoder output was used for within-sentence 
adaptation, while the correct sentence transcription was used 
for across-sentence adaptation. Results are summarized in 
table 2. 
language model word error rate % reduction 
static trigram (baseline) 19.9% 
unsupervised adaptation 17.8% 10% 
supervised adaptation 17.0% 14% 
Table 2: Word error rate reduction of adaptive language mod- 
els over a conventional trigram model. 
which the test data comes from a source to which the language 
model has never been exposed. The most salient aspect of this 
case is the large number of out-of-vocabulary words, as well 
as the high proportion of new bigrams and trigrams. 
Cross-domain adaptation is most important in cases where 
no data from the test domain is available for training the 
system. But in practice this rarely happens. More likely, a 
limited amount of LM training can be obtained. Thus a hybrid 
paradigm, limited-data domain, might be the most important 
one for real-world applications. 
The main disadvantage of the Maximum Entropy framework 
is the computational requirements of training the ME model. 
But these are not severe for modest amounts of training d~t~ 
(up to, say, 5M words, with current CPUs). The approach is 
thus particularly attractive in limited-data domains. 
7. THE AP WIRE EXPERIMENT 
We have already seen the effect of the amount of training 
data on perplexity reduction in the WSJ system. To test 
our adaptation mechanisms under both the cross-domain and 
limited-data paradigms, we constructed another experiment, 
this time using AP wire data for testing. 
For measuring cross-domain aa_aptation, we used the 38MW 
WSJ models described above. For measuring limited-data 
adaptation, we used 5M words of AP wire to train a con- 
ventional compact backoff trigram, and a Maximum Entropy 
model, similar to the ones used by the WSJ system, except 
that the trigger pair list was copied from the WSJ system. 
All models were tested on 420,000 words of unseen AP a,t~: 
We chose the same "200" vocabulary used in the WSJ exper- 
iments, to facilitate cross comparisons. As before, we mea- 
sured perplexities ofthebaseline trigram model, the maximum 
Entropy component, and the interpolated adaptive model. Re- 
suits are summarized in table 3. 
To test error rate reduction under the cross.domain adapta- 
tion paradigm, we used 206 sentences, recorded by 3 male 
and 3 female speakers, under the same system configuration 
described in section. Results are reported in table 4. 
6. THREE PARADIGMS OF ADAPTATION 
The adaptation we concentrated on so far was the kind we call 
within-domain adaptation. In this paradigm, a heterogeneous 
language source (such as WSJ) is treated as a complex product 
of multiple domains-of-discourse ("sublanguages"). The goal 
is then to produce a continuously modified model that tracks 
sublangnage mixtures, sublanguage shifts, style shifts, etc. 
In contrast, a cross-domain adaptation paradigm is one in 
8. SUMMARY 
We described our latest attempt at adaptive language model- 
ing. At the heart of our approach is a Maximum Entropy (ME) 
model, which incorporates many knowledge sources in a con- 
sistent manner. We have demonstrated that the ME model 
significantly improves on the conventional static trigram, a 
challenge which has evaded many past attempts(\[17, 18\]). 
The approach is particularly applicable in domains with a 
modest amount of LM training data. 
80 
paradigm cross-domain limited-data 
training data 38MW (WSJ) 5M (AP) 
trigram (baseline) 
perplexity 206 170 
Maximum Entropy 
perplexity i70 135 
PP reduction 17 % 21% 
interpolated model 
perplexity 130 114 
PP reduction 37% 33% 
Table 3: Perplexity improvement of Maximum Entropy and 
interpolated ad_~ptive models, for both eross-domain and 
fimited-data adaptation, testing on 420KW of unseen AP wire 
9. ACKNOWLEDGEMENTS 
I am grateful to the entire CMU speech group, and many 
other individuals at CMU, for generously allowing me to 
monopolize their machines for weeks on end. I am particularly 
grateful to Lin Chase and Ravishankar Mosur for much needed 
help in designing and implementing the interface to SPHINX- 
II, to Alex Rudnicky for conditioning tools for the AP wire 
data, and to Raj Reddy for his support and encouragement. 
The ideas for this work were developed during my 1992 sum- 
mer visit with the Speech and Natural Language group at 
IBM Watson Research Center. I am grateful to Peter Brown, 
Stephen Della Pietra, Vincent Della Pietra, Raymond Lau, 
Bob Mercer and Salim Roukos for their very significant par- 
t/cipat/on. 
This research was sponsored by the Department of the Navy, 
Naval Research Laboratory under Grant No. N00014-93-1- 
2005. The views and conclusions contained in this document 
are those of the authors and should not be interpreted as rep- 
resenting the official policies, either expressed or implied, of 
the U.S. Government. 
training data 38MW (WSJ) 
test data 206 sentences (AP) 
language model word error rate 1% change 
Irigram (baseline) 22.1% 
supervised adaptation 19.8% -10% 
Table 4: Word error rate reduction of the adaptive language 
model over a conventional trigram model, under the cross- 
domain adaptation paradigm. 
References 
1. Rosenfeld, R., "Adaptive Statistical Language Modeling: a 
Maximum Enlropy Approach." Ph.D. Thesis, CarnegieMellon 
University, April 1994. 
2. Lan, R.o Rosenfeld, R., Roukos, S., "Trigger-Based Language 
Models: a Maximum Entropy Approach." Proceedings of 
ICASSP-93, April 1993. 
3. Lan, R., Rosenfeld, R., Roukos, S., "Adaptive Language Mod- 
eling Using the Maximum Entropy Principle", in Proc. ARPA 
Human Language Technology Workshop, March 1993. 
4. Jaines, E. T., "Information Theo W and Statistical Mechanics." 
Phys. Rev. 106, pp. 620-630, 1957. 
5. Kullback. S., Information Theory in Statistics. W'fley, New 
York. 1959. 
6. Darroch. J.N. and Ratcliff, D., "Generalized Iterative Sealing 
for Log-Linear Models", The Annals of Mathematical Statis- 
tics, VoL 43, pp 1470-1480,1972. 
7. Brown, P., Della Pielra, S., Della Pielra, V., Mercer, R., Nadu, 
A., and Roukos, S., "Maximum Enlropy Methods and Their 
Applications to Maximum Likelihood Parameter Estimation of 
Conditional Exponential Models," A forthcoming IBM techni- 
col report. 
8. Huang, X.D., Alleva, F., Hen, H.W., Hwang, M.Y., Lee, K.F. 
and Rosenfeld, R., "The SPHINX-II Speech Recognition Sys- 
tem: An Overview." Computer, Speech andLan&ua&e, 1992. 
9. Rosenfeld, R., and Huang, X. D., "Improvements in Stochas- 
tic Language Modeling." Prec. DARPA Speech and Natural 
Language Workshop, February 1992. 
10. Kuhn, R., "Speech Recognition and the Frequency of Re- 
cently Used Words: A Modified Marker Model for Natural 
Language." 12th International Conference on Computational 
Linguistics \[COLlNG 88\], pages 348-350, Budapest, August 
1988. 
11. Kubala, E et al., "The Hub and Spoke Paradigm for CSR Evalu- 
ation," in Proc.ARPA Human Language Technology Workshop, 
March 1994. 
12. Pallett, D.S., Fiscus, J.G., Fisher, W.M., Garofolo, J.S., Lund, 
B., and IhTzbocki, M, "1993 Benchmark Tests for the ARPA 
spoken Language Program", in Prec. ARPA Human Language 
Technology Workshop, March 1994. 
13. Rosenfeld, R., "Language Model Adaptation in ARPA's CSR 
Evaluation", ARPA Spoken Language Systems Workshop, 
March 1994. 
14. Huang, X.D., Alleva, E, Hop., H.W., Hwang, M.Y., Lee, ICE, 
and Rosenfeld, R., "The SPHINX-II Speech Recognition Sys- 
tem: An Overview", Computer, Speech and Language, 1993. 
15. Huang, X., Alieva, E, Hwang, M-Y, and Rosenfeld, R., "An 
Overview of the SPHINX-II Speech Recognition System", in 
Prec. ARPA Human Language Technology Workshop, March 
1993. 
16. Hwang, M., Rosenfeld, R., Thayex; E., Mosur, R., Chase, L., 
Weide, R., Huang, X., and Alleva, F., "Improving Speech- 
Recognition Performance Via Phone-Dependent VQ Code- 
books, Multiple Speaker Clusters And Adaptive Language 
Models", ARPA Spoken Language Systems Workshop, March 
1994. 
17. Bahl, L., Brown, E, DeSouza, P., and Mercer, R., "A Tree- 
Based Statistical Language Model for natural Language Speech 
Recognition", IEEE Transactions on Acustics. Speech and Sig- 
nal Processing, 37, pp. 1001-1008, 1989. 
18. Jelinek. E, "Up From Ttigramsl" Eurospeech 1991. 
81 
