Context Dependent Modeling of Phones in Continuous 
Speech Using Decision Trees 
L.R. Bahl, P.V. de Souza, P.S. Gopalakrishnan,D. Nahamoo, M.A. Picheny 
IBM Research Division 
Thomas J. Watson Research Center 
P.O. Box 704, Yorktown Heights, NY 10598 
ABSTRACT 
In a continuous speech recognition system it is impor- 
tant to model the context dependent variations in the pro- 
nunciations of words. In this paper we present an automatic 
method for modeling phonological variation using decision 
trees. For each phone we construct a decision tree that spec- 
ifies the acoustic realization of the phone as a function of 
the context in which it appears. Several thousand sentences 
from a natural language corpus spoken by several talkers are 
used to construct these decision trees. Experimental results 
on a 5000-word vocabulary natural language speech recog- 
nition task are presented. 
INTRODUCTION 
It is well known that the pronunciation of a word 
or subword unit such as a phone depends heavily on 
the context. This phenomenon has been studied ex- 
tensively by phoneticians who have constructed sets of 
phonological rules that explain this context dependence 
\[8, 14\]. t\[owever, the use of such rules in recognition 
systems has not been extremely successful. Perhaps, a 
fundamental problem with this approach is that it re- 
lies on human perception rather than acoustic reality. 
Furthermore, this method only identifies gross changes, 
and the more subtle changes, which are generally unim- 
portant to humans but may be of significant value in 
speech recognition by computers, are ignored. Possibly, 
rules constructed with the aid of spectrograms would 
be more useflfl, but this would be very tedious and dif- 
ficult. 
In this paper we describe an automatic method for 
modeling the context dependence of pronunciation. \]n 
particular, we expand on the use of decision trees for 
modeling allophonic variation, which we previously out- 
lined in \[4\]. Other researchers have modeled a.ll distinct 
sequences o\[ tri-phones (three consecutive phones) in 
an effort to capture phonological variations \[16, 10\]. 
The method proposed in this paper has the advantage 
that it allows us to account for much longer contexts. 
In the experiments reported in this paper, we model 
the pron0nciation of a phone as a function of the five 
preceding and five \[ollowing phones. This method also 
has better powers o\[ generalization, i.e. modeling con- 
texts tha.t do not occur in the training data,. 
Use of decision trees for identifying allophones have 
been considered in \[4, 7, 1\], \]5\]. However, apart from 
\[4\], these methods have either not been used in a rec- 
ognizer or have not provided significant improvements 
over existing modeling methods. 
In the next section we describe the algorithnls used 
for constructing the decision trees. In Section 3 we 
present recognition results for a 5000-word natural lan- 
guage continuous speech recognition task. We also 
present results showing the the effect of varying tree 
size and context on the recognition accuracy. Conclud- 
ing remarks are presented in Section 4. 
CONSTRUCTING THE DECISION TREE 
The data used for constructing tile decision trees is 
obtained from a database of 20,000 continuous speech 
natural language sentences spoken by 10 different speak- 
ers. For more details about this database, see \[4\]. Spec- 
tral feature vectors are extracted from the speech at a 
rate of 100 frames per second. These frames are la- 
beled by a vector quantizer using a common alpha- 
bet for all the si)eakers. This data is used to train a 
set of phonetic Markov models for the words. Using 
the trained phonetic Markov model statistics and the 
Yit~rbi algorithm, the labeled speech is then a.ligned 
264 
against the phonetic basefoims. This process results 
in an alignment of a sequence of phones (the phone 
sequence obtained by concatenating the phonetic base- 
rearms of the words in the entire training script) with the 
ia.bel sequence produced by the w.ctor qua.ntizer, For 
each aligned phone we construct a data record which 
contains the identity of the current phone, denoted as 
P0, the context, i.e. the identities of the K previous 
phones and K following phones in the phone sequence, 
denoted as P-K,...P-1,P1,...PIc, and the label se- 
quence Migned against the current phone, denoted as 
y. We partition this collection of data on the basis of 
P0- Thus we have collected, for each phone in the phone 
alphabet, several thousand instances of label sequences 
in various phonetic contexts. Based on this annotated 
data we construct a decision tree for each phone. 
If we had an unlimited supply of annotated data, 
we could solve the context dependence problem exhaus- 
tively by constructing a different model for each phone 
in each possible context. Of course, we do not have 
enough data. to do this, but even if we could carry out 
the exhaustive solution, it would take a large amount of 
storage to store all the different models. Thus, because 
of limited data, and a need for parsimony, we com- 
bine the contexts into equivalence classes, and make a 
model for each class. Obviously, each equivalence class 
should consist of contexts that result in similar label 
strings. One effective way of constructing such equiv- 
alence classes is by the use of binary decision trees. 
Readers interested in this topic are urged to read Clas- 
sification and Regression Trees by Breiman, Friedman, 
Olshen and Stone \[6\]. 
To construct a binary decision tree we begin with 
a collection of data, which in out case consists of all 
the annotated samples for a particular phone. We split 
this into two subsets, and then split each of these two 
subsets into two smaller subsets, and so on. The split- 
ting is done on the basis of binary questions about the 
context Pi, for i = :hl,... • K. In order to construct 
the tree, we need to have a goodness-of-split evMua- 
tion function. We base the goodness-of-split evaluation 
function on a, probabilistic measure that is related to 
the homogeniety of a set of label strings. Finally, we 
need some stopping criteria. We terminate splitting 
when the number of samples a.t a node falls below a. 
threshold, or if the goodness of the best split falls be- 
low a threshold. The result is a binary tree in which 
each terminal node represents one equivalence class of 
contexts. Using the label strings associated with a ter- 
minal node we can construct a fenonic Marker model 
\[or that node by the method described in \[1, 2\]. During 
recognition, given a. pl, one and its context, we use the 
decision tree o{ that phone to determine which model 
sho,ld be used. By answering the questions about the 
context at the nodes of the tree, we trace a path to a. 
t~rminal tmdp. of the troo,~ which ~p~eifie.~ the model to 
I)e used. 
Let Q denote a set of binary questions about the 
context. Let n denote a node in the tree, and m(q,n) 
the goodness of the split induced by question q 6 Q at 
node n. We will need to distinguish between tested and 
untested nodes. A tested node is one on which we have 
evaluated m(q,n) for all questions q 6 Q and either 
split the node or designated it as a terminal node. It is 
well-known that the construction of an optimal binary 
decision tree is an NP-hard problem. We use a sub- 
optimal greedy algorithm to construct the tree, select- 
ing the best question from the set Q at each node. In 
outline, the decision tree construction algorithm works 
as follows. We start with all samples at the toot node. 
In each iteration we select some untested node n and 
evaluate re(q, n) for all possible questions q 6 Q at this 
node. If a stopping criterion is met, we declare node 
n as terminal, otherwise we associate the question q 
with the highest value of m(n,q) with this node. We 
make two new successor nodes. All samples that an- 
swer positively to the question q a.~ce t~ansferred to the 
left successor and all other samples are transferred to 
the right successor. We repeat these steps till all nodes 
have been tested. 
'.\['he most important aspects of this algorithm are 
the set of questions Q, the goodness-of-split evaluation 
function re(q, n), and the stopping criteria. We discuss 
each of these below. 
The Question Set 
Let P denote the aiphal)et of phones, and Np the 
size of this alphal)et. In our case Np = 55. The 
question set Q consists of questions of the form \[ Is 
I~ 6 S \] where ,5" C P. We start with singleton sub- 
sets of P, e.g. $ = {p}, S = {t}, etc. In addition, 
we use subsets corresponding to phonologically mean- 
inghfl classes of phones commonly used in the analysis 
of speech \[9\], e.g., $ = {p,t, k} (all unvoiced stops), 
,5' = {p, t, t;,b,d, g} (all stops), etc. Each question is 
applied to each element /~ for i = /:1,... • K, of the 
context. If there are Ns subsets in all, the number of 
questions NQ is given by NQ = 2KN~. Thus these will 
be NQ splits to be evaluated at each node of the tree. 
In our experiments K = 5 and Ns = 130, leading to a. 
total of 1300 questions. 
265 
Note that, in general, there are 2 NP different sub- 
sets of P, and, in principle, we could consider all 2K2 Np 
questions. Since this would be too expensive, we have 
chosen what we consider to be a meaningful subset of 
all possible questions and consider only this .fixed set 
of questions during tree construction. It is possible to 
generalize the tree construction procedure to use vari- 
able questions which are constructed algorithmically as 
part of the tree construction process, as in \[5, 13\]. 
Furthermore, the type of questions we use are called 
simple questions, since each question is applied to one 
element of the context at a time. It is possible to con- 
struct complex questions which deal with several con- 
text elements at once, as in \[5\]. Again, we did not use 
this more complicated technique in the experiments re- 
ported in this paper. 
The Goodness-of-Split Evaluation Function 
We derive the goodness-obsplit evaluation function 
based on a probabilistic model of collections of label 
strings. Let .M denote a particular class of paramet- 
ric models that assign probabilities to label strings. 
For any model M 6 M let PrM(y) denote the prob- 
ability assigned to label string y. Let }\]~ be the set 
of label strings associated with node n. PrM(Yn) = 
I-Iy~y, PrM(y) is a measure of how well the model M 
fits the data at node n. Let Mn 6 .h4 be the best model 
for Yn, i.e. PrM.(I~z ) > FrM(}\]~ ) for all M. PrM.(~ ) 
is a measure of the purity of Y,. If the label strings 
in )~ are similar to each other, then PrM. (}~) will be 
large. A question q will split the data at node n into 
two subsets based on the outcome of question q. Our 
goal is to pick q so as to make the successor nodes as 
pure as possible. Let ~ and Y~ denote the subsets of 
label strings at the left and right successor nodes, re- 
spectively. Obviously, )~ U Y~ = 1~. Let Mt and )lI~ 
be the corresponding best models for the two subsets. 
\]'hen 
re(q, n) = log ((PrMt(I~)PrM~ (}5))/PrM. (1~)) (1) 
is a measure of the improvement in purity as a result 
of the split. Since our goal is to divide the strings inl, o 
subsets containing similar strings, this quantity serves 
us well as the goodness-of-split evaluation f, nction. 
Since, we will eventually use the strings at a ter- 
minal node to construct a Markov model, choosing M 
to be a class of Markov models would be the natural 
choice. Unfortunately, this choice of model is compn- 
tationa.lly very expensive. To find the best model M,, 
we would have to train the model, using the forward- 
backward algorithm using all the data at the node n. 
Thus for computational reasons, we have chosen a sim- 
pler class of models - Poisson models of the type used 
ill \[3\] for the polling fast match. 
Recall that y is a sequence of acoustic labels al, a2, 
... at. We make the simplifying assumption that the 
\]abels in tile sequence are independent of each other. 
The extent to which this approximation is inaccurate 
depends on the length of the units being modeled. For 
strings corresponding to single phones, the inaccuracy 
introduced by this approximation is relatively small. 
However, it results in an evaluation function that is 
easy to compute and leads to the construction of very 
good decision trees in practice. 
A result of this assumption is that the order in 
which the labels occur is of no consequence. Now, a 
string y can be fully characterized by its histogram, 
i.e. the n,mber of times each label in the acoustic la- 
bel alphabet occurs in that string. We represent the 
string y by its histogram YlY>..YF, a. vector of length 
F where F is the size of the acoustic label alphabet and 
each Yi is the number of times label i occurs in string y. 
We model each component Yi of the histogram by an 
independent Poisson model with mean rate #i. Then, 
the probability assigned to y by M is 
e,..(y) = H (2) i=I Yi! 
The joint probability of all the strings in the set, Yn is 
then F 
PrM(}~,) = I~ lq #~'e-'~ (3) 
y~Y. i=1 Yi! 
It can be easily shown that PrM(}~ ) is maximized by 
choosing the mean rate to be tile sample average, i.e., 
the best model for )';, has as its mean rate 
J It,,~- N,, ~ yi for i=l,2.,.V (4) 
' y615, 
Let Itli a,,d t*~i for i = 1,2...F, denote the optimal 
mean rates for Y} and I'~. respectively. Using these ex- 
pressions iu (1) and eliminating common terms, we can 
show that the evaluation function is given by 
F 
re(q, n) = ~ {Nl#u log ttli+ N~#~i log #~i 
i=1 
- N.#,. log (5) 
where Nt is the total number of strings at the left node 
a.n(I ArT is the total number of strings at the right node 
266 
resulting from split q. At each node, we select the 
question q that maximizes the evaluation function (5), 
The evaluation function given in equation (5) is 
very general, and arises from several different model as- 
sumptions. For example, if we assume that the length 
of each string is given by a Poisson distribution, and 
the labels in a string are produced independently by a 
multinomial distribution, then the evaluation function 
of equation (5) results. There are also some interesting 
relationships between this function and a minimization 
of entropy formulation. Due to space \]imitations, the 
details are omitted here. 
The stopping criteria 
We use two very simple stopping criteria. If the 
value re(q, n) of the best split at a node n is less than 
a threshold Tm we designate it to be a terminal node. 
Also, if the number of samples at a node falls below a 
threshold T~ then we designate it to be a terminal node. 
The thresholds Tm and T~ ate selected empirically. 
Using the Decision Trees During Recogni- 
tion 
The terminal nodes of a tree for a phone correspond 
to the different allophones of the phone. We construct 
a fenonic Markov model for each terminal node from 
the label strings associated with the node. The details 
of this procedure are described in \[1, 2\]. 
During recognition, we construct Markov models for 
word sequences as follows. We construct a sequence of 
phones by concatenating the phonetic baseforms of the 
words. For each phone in this sequence, we use the 
appropriate decision tree and trace the path in the tree 
corresponding to the context provided by the phone 
sequence. This leads to a terminal node, and we use 
the fenonic Markov model associated with this node. 
By concatenating the fenonic Markov models for each 
phone we obtain a Markov model for the entire word 
sequence. 
For the last few phones in the phone sequence, the 
right context is not fully known. For these phones, 
we make tentative models ignoring the unknown right 
context. When the sequence of words is extended, the 
right context for these phones will be available, and we 
can replace the tentative models by the correct models 
and recompute the acoustic match probabilities. This 
procedure is quite simple and the details are omitted 
here. 
EXPERIMENTAL RESULTS 
We to~t~d tbi. m~tho,l o11 ~ 5000.~word, continuo.~ 
speech, natural language task. The test vocabulary 
consists of the 5000 most frequent words taken from 
a large quantity of IBM electronic mail. The training 
data consisted of 2000 sentences read by each of 10 dif- 
ferent talkers. Tile first 500 sentences were the same 
for each talker, while the other 1500 were different from 
talker to talker. The training sentences were covered 
by a 20,000 word vocabulary and the allophonic models 
were constructed for this vocabulary. The test set con- 
sisted of 50 sentences (591 words) covered by our 5000 
word vocabulary, Interested readers can refer to \[4\] \[or 
more details of the task and the recognition system. 
We constructed tire decision trees using the train- 
ing data. described above. The phone alphabet was of 
size 55, ff was chosen to be 5, and on the average the 
number of terminal nodes per decision tree and con- 
sequently the number of allophones per phone was 45. 
We tested the system with 10 talkers. The error rates 
reported here are \[or the same l0 talkers whose ut- 
terances were ,sed for constrncting the decision trees. 
Each talker provided roughly 2,000 sentences of train- 
ing data for constructing the vector quantizer proto- 
types and for training the Markov model parameters. 
Tests were also done using context independent pho- 
netic models. In both, the same vector quantizer pro- 
totypes were used and the models were trained using 
the same data. Table l shows the error rates for the 
phonetic (context independent) and aUophonic (con- 
text dependent) models for the \]0 talkers. On the aver- 
age, the word error rate decreases from 10.3% to 5.9%. 
We tested the performance of our Mlophonic models 
for talkers who were not part o\[ the training database. 
The error rates for five new test talkers using the a,llo- 
pho,ic models are shown in Table 2. As can be seen, 
the error rates obtained using the allophonic models 
are coral)arable to those given in Table \]. 
We also trained and decoded using triphone-based 
HMMs \[16\]. In these experiments, only intra-word tri- 
phone models were used; we did not attempt to con- 
struct cross-word triphone models. The number of pho- 
netic models in our system is 55; approximately 10000 
triphone models were required to cower our 20000 word 
vocabulary. No attempt was made to cluster these 
into a smaller number as is done for generMized tri- 
phones \[10\]. Both phonetic and triphone models were 
trained using the forward-backward algorithm in the 
usual manner; the triphone statistics were smoothed 
back onto the underlying phonetic models via deleted 
267 
estimation. The topology of the triphone and and pho- 
netic models were seven-state models with independent 
distributions for the beginning, middle, a.nd end of each 
phone as described in \[12\]. Results are shown in the 
fourth column of Table 1. These results are signifi- 
cantly worse than the results obtained with our allo- 
phonic models. However, it should be noted that these 
tri-phone models do not incorporate several techniques 
that are currently in use \[11\]. 
Varying Context and Tree Size 
The number of preceding and following phones that 
are used in the construction of the decision tree influ- 
ences the recognition performance considerably. We 
constructed several decision trees that examine differ- 
ent amounts of context and the recognition error rates 
obtained using these models is shown in Table 3. The 
second column shows results for models constructed us- 
ing decision trees that examine only one phone preced- 
ing and following the current one. The third column 
shows results for trees that examine two phones pre- 
ceding and following the current phone and so on to 
the last column for trees that examine five preceding 
and following phones. The stopping criterion used in 
all cases was the same, as was the training and test set. 
These results show that increasing the amount of con- 
text information improves the recognition performance 
of the system using these models. 
An important issue in constructing decision trees 
is when to stop splitting the nodes. As we generate 
mote and more nodes, the tree gets better and better 
for the training data but may not be appropriate for 
new data. In order to find an appropriate tree size, we 
conducted several decoding experiments using models 
constructed from decision trees of various sizes built 
from the same training data.. We constructed decision 
trees of different sizes using the following scheme. We 
first constructed a set of decision trees using the algo- 
rithms given in Section 2, but without using the stop- 
ping criterion based on the goodness-of-split evaluation 
function. The splitting is terminated only when we are 
left with one sample at a node or when all samples 
at a node have identical context so that no question 
can split the node. The context used consisted of the 
5 preceding and following phones. Now, sets of trees 
of varying sizes can be obtained from these large trees 
by pruning. We store the vMue of the goodness-of-split 
evaluation function re(q, n) obtained at each node. The 
tree for each phone is pruned back as follows. We ex- 
amine all nodes n both of whose successor nodes are 
terminal nodes. From among these we select the node 
n* which has the smallest value for the evaluation Nnc- 
tion re(q, n*). If this va.lue is less than a theshold Tm 
we discard this split, and mark the node n* as a leaf. 
This process is repeated until no more pruning can be 
done. By varying the pruning threshold ~n, we can 
obtain decision trees with different number of nodes. 
Table 4 shows the decoding error rates using models 
obtained for trees of various sizes. The second column 
shows the results obtained with trees having an aver- 
age of 23 terminal nodes (allphones) per phone. The 
third, fourth, and fifth columns show the error rates for 
33, 45, and 85 Mlophones per phone respectively. The 
training and test sets were the same as that described 
earlier in this section. As can be seen, increasing the 
number of Mlophones beyond 45 did not result in in- 
creased accuracy. 
CONCLUSIONS 
Acoustic models used in continuous speech recog- 
nition systems should account for variations in pro- 
nunciation arising from contextual effects. This paper 
demonstrates that such effects can be discovered au- 
tomatically, and represented very effectively using bi- 
nary decision trees. We have presented a method for 
constructing and using decision trees for modeling allo- 
phonic variation. Experiments with continuous speech 
recognition show that this method is effective in reduc- 
ing the word error rate. 
REFERENCES 
\[1\] L.R. Baltl, P.F. Brown, P.V. de Souz~, R.L. Mercer, 
M.A. Pieheny, "Automatic Construction of Acoustic 
Markov Models for Words," Proc. International Sym- 
posium on SignM Processing and Its Applications, Bris- 
ha,e, Australia, \]987, pp.565-569. 
\[2\] L.R. Bald, P.F. Brown, P.V. de Souza, R.L. Mercer, and 
M.A. Picheny, "Acoustic Markov Models Used in the 
Tangora Speech Recognition System," Proc. ICASSP- 
88, New York, NY, April 1988, pp. 497-500. 
\[3\] L.R. Bahl, R. Bakis, P.V. de Souza., and R.L. Mercer, 
"Obt~htlng Candidate Words by Polling in a Large Vo- 
cab,lary Speech Recognition System," Proc. ICASSP- 
88, New York, NY, April 1988, pp. 489-492. 
\[4\] L.R. Bahl et. al.,"Large Voca.bulary Natural Language 
Continuous Speech Recognition," Froc. ICASSP-89, 
Glasgow, Scotland, May 1989, pp.465-467 
\[5\] L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer, 
"A Tree-Based Language Model for Natural Language 
Speech Recognitiop," IFBEE Transactions on ASSP, Vol. 
37, No. 7, July 1989, pp.1001-1008. 
\[6\] L. Breiman, J.}t. Friedman, R.A. Olshen, C.J. Stone, 
6'lass~fication and Regression Trees, Wadsworth Statis- 
tlcs/Prob~bility Series, Behnont, CA, 1984. 
268 
Speaker 
T1 10.5 
T2 17.8 
T3 13.4 
T4 12.5 
T5 2.9 
T6 14.4 
T7 8.3 
T8 3.2 
T9 5.9 
T10 13.7 
Average 
Models Used 
Phonetic I Allophonie \[ 'l~iphone 
8.3 
9.5 
6.6 
7.6 
1.9 
7.8 
3.6 
3.0 
3.6 
7.6 
8.6 
12.0 
9.0 
8.8 
2.0 
11.7 
8.5 
2.9 
5.9 
10.2 
10.3% 5.9% 8.0% 
Table 1: Recognition Error Rate 
Speaker Error rate with Allophonic Models 
Tll 
T12 
TI3 
T14 
T15 
3.2 
3.5 
8.1 
6.9 
5.5 
Average 5.4% 
Table 2: Error Rate on New Set of Test Talkers 
\[7\] F.R. Chen, J. Shrager, "Automatic Discovery of Con- 
textual Factors Describing Phonological Variation", 
Proc. 1989 DARPA Workshop on Speech and Natural 
Language. 
\[8\] P.S. Cohen and R.L. Mercer, "The Phonological Com- 
ponent of an Automatic Speech Recognition System," 
in Speech Recognition, D.R Reddy, editor, Academic 
Press, NewYork, 1975, pp.275-320. 
\[9\] G. Pant, Speech Sounds and Features, MIT Press, Cam- 
bridge, MA, 1973. 
\[10\] K.F. Lee, H.W. Hon, M.Y. Hwang, S. Mahajan, R. 
Reddy, "The Sphinx Speech Recognition System," Proc 
ICASSP-89, Glasgow, Scotland, May 1989, pp.445-448 
\[11\] K.F. Lee, et. al., "Allophone Clustering for Continuous 
Speech Recognition", Proc. ICASSP-90, Albuquerque, 
NM, April 1990, pp.749-752. 
\[12\] B. MSrialdo, "Multilevel decoding for very large size 
dictionary speech recognition," IBM Journal of Re- 
search and Development, vol. 32, March 1988, pp. 227- 
237. 
\[13\] A. Nadas, D. Nahamoo, M.A. Picheny and J. Powell, 
"An Iterative Flip-Flop Approximation of the Most In- 
formative Split in the Construction of Decision Trees," 
Proc ICASSP-91, to appear. 
\[14\] B.T. Oshika, V.W. Zue, R.V. Weeks, H. Nue and 
J. Auerbach, "The Role of Phonological Rules in 
Speech Understanding Research," IEEE Transactions 
on ASSP, Vol. ASSP-23, 1975, pp. 104-112. 
\[15\] M.A. Randolph, "A Data-Driven Method for Dis- 
covering and Predicting Allophonic Variation", Proc. 
ICASSP-90, Albuquerque, NM, April 1990, pp.1177- 
1180. 
\[16\] R. Schwartz, Y. Chow, O. Kimball, S. Roucos, M. Kras- 
net, J. Makhoul, "Context-Dependent Modeling for 
Acoustic-Phonetic Recognition of Continuous Speech," 
Proc. ICASSP-85, April 1985 
Speaker 
T1 
T2 
T3 
T4 
T5 
T6 
T7 
T8 
T9 
T10 
Average 
Amount of Context Used 
K=J K=2\[K=3 K=5 J 
7.1- 9.5 I 8.6 
11.7 161i5 9.0 
7.8 6.8 
7.8 6:6 7.4 
9.5 \] 8.3 
6.8 \[ 5.9 4.2 
4.6 I 2.7 \[ 4.4 
4.7 4.1 4.6 
7.8 _~ 7.4 6.6 
6 5% 6:2  
8.3 I 
9.5 
6.6 
7.6 
1.9 
7.8 
3.6 
3.0 
3.6 
7.6 
5.9% 
Table 3: Error Rates with Varying Context Length 
Speaker 
Tl 
T2 
T3 
T4 
T5 
T6 
T7 
T8 
T9 
T10 
Average 
Average Number of Allophones 
23 133 45 85 
------4---'---- 9.6 19.1 8.3 8.0 
11.2 I 10.8 9.5 9.6 
8.1 16.8 6.6 5.8 
7.8 17.6 7.6 6.4 
1.91 1.9 1.9 3.2 
9.3 I 9.0 7.8 7.8 
5.4 I 4.9 3.6 4.1 
3.4 I 4.2 3.0 3.2 
4.913.9 3.6 4.4 
6.8 J 6.3 7.6 6.9 
6--.8%i 6.4% 5.9% 5.9% 
Table 4: Error Rates for Different Tree Sizes 
269 
