New Techniques for Context Modeling 
Eric Sven Ristad and Robert G. Thomas 
Department of Computer Science 
Princeton University 
{ristad, rgt }~cs. princeton, edu 
Abstract 
We introduce three new techniques for sta- 
tistical language models: extension mod- 
eling, nonmonotonic contexts, and the di- 
vergence heuristic. Together these tech- 
niques result in language models that have 
few states, even fewer parameters, and low 
message entropies. 
1 Introduction 
Current approaches to automatic speech and hand- 
writing transcription demand a strong language 
model with a small number of states and an even 
smaller number of parameters. If the model entropy 
is high, then transcription results are abysmal. If 
there are too many states, then transcription be- 
comes computationally infeasible. And if there are 
too many parameters; then "overfitting" occurs and 
predictive performance degrades. 
In this paper we introduce three new techniques 
for statistical language models: extension modeling, 
nonmonotonic contexts, and the divergence heuris- 
tic. Together these techniques result in language 
models that have few states, even fewer parameters, 
and low message entropies. For example, our tech- 
niques achieve a message entropy of 1.97 bits/char 
on the Brown corpus using only 89,325 parameters. 
By modestly increasing the number of model param- 
eters in a principled manner, our techniques are able 
to further reduce the message entropy of the Brown 
Corpus to 1.91 bits/char. 1 In contrast, the charac- 
ter 4-gram model requires 250 times as many pa- 
rameters in order to achieve a message entropy of 
only 2.47 bits/char. Given the logarithmic nature 
of codelengths, a savings of 0.5 bits/char is quite 
significant. The fact that our model performs signif- 
icantly better using vastly fewer parameters argues 
1The only change to our model selection procedure is 
to replace the incremental cost formula ALe(w, ~', a) 
with a constant cost of 2 bits/extension. This small 
change reduces the test message entropy from 1.97 to 
1.91 bits/char but it also quadruples the number of 
model parameters and triples the total codelength. 
that it is a much better probability model of natural 
language text. 
Our first two techniques - nonmonolonic contexts 
and exlension modeling - are generalizations of the 
traditional context model (Cleary and Witten 1984; 
Rissanen 1983,1986). Our third technique - the di- 
vergence heuristic - is an incremental model selec- 
tion criterion based directly on Rissanen's (1978) 
minimum description length (MDL) principle. The 
MDL principle states that the best model is the sim- 
plest model that provides a compact description of 
the observed data. 
In the traditional context model, every prefix and 
every suffix of a context is also a context. Three 
consequences follow from this property. The first 
consequence is that the context dictionary is un- 
necessarily large because most of these contexts are 
redundant. The second consequence is to attenu- 
ate the benefits of context blending, because most 
contexts are equivalent to their maximal proper suf- 
fixes. The third consequence is that the length of the 
longest candidate context can increase by at most 
one symbol at each time step, which impairs the 
model's ability to model complex sources. In a non- 
monotonic model, this constraint is relaxed to allow 
compact dictionaries, discontinuous backoff, and ar- 
bitrary context switching. 
The traditional context model maps every history 
to a unique context. All symbols are predicted us- 
ing that context, and those predictions are estimated 
using the same set of histories. In contrast, an exten- 
sion model maps every history to a sel of contexts, 
one for each symbol in the alphabet. Each symbol is 
predicted in its own context, and the model's current 
predictions need not be estimated using the same 
set of histories. This is a form of parameter tying 
that increases the accuracy of the model's predic- 
tions while reducing the number of free parameters 
in the model. 
As a result of these two generalizations, nonmono- 
tonic extension models can outperform their equiv- 
alent context models using significantly fewer pa- 
rameters. For example, an order 3 n-gram (ie., the 
4-gram) requires more than 51 times as many con- 
220 
texts and 787 times as many parameters as the order 
3 nonmonotonic extension model, yet already per- 
forms worse on the Brown corpus by 0.08 bits/char. 
Our third contribution is the divergence heuris- 
tic, which adds a more specific context to the model 
only when it reduces the codelength of the past data 
more than it increases the codelength of the model. 
In contrast, the traditional selection heuristic adds a 
more specific context to the model only if it's entropy 
is less than the entropy of the more general context 
(Rissanen 1983,1986). The traditional minimum en- 
tropy heuristic is a special case of the more effective 
and more powerful divergence heuristic. The diver- 
gence heuristic allows our models to generalize from 
the training corpus to the testing corpus, even for 
nonstationary sources such as the Brown corpus. 
The remainder of our article is organized into 
three sections. In section 2, we formally define the 
class of extension models and present a heuristic 
model selection algorithm for that model class based 
on the divergence criterion. Next, in section 3, we 
demonstrate the efficacy of our techniques on the 
Brown Corpus, an eclectic collection of English prose 
containing approximately one million words of text. 
Section 4 discusses possible improvements to the 
model class. 
2 Extension Model Class 
This section consists of four parts. In 2.1, we for- 
mally define the class of extension models and prove 
that they satisfy the axioms of probability. In 2.2, 
we show to estimate the parameters of an exten- 
sion model using Moffat's (1990) "method C." In 2.3, 
we provide codelength formulas for our model class, 
based on efficient enumerative codes. These code- 
length formulas will be used to match the complexity 
of the model to the complexity of the data. In 2.4, 
we present a heuristic model selection algorithm that 
adds parameters to an extension model only when 
they reduce the codelength of the data more than 
they increase the codelength of the model. 
2.1 Model Class Definition 
Formally, an extension model ¢ : (E, D, E, A) con- 
sists of a finite alphabet E, \[E\[ = m, a dictionary 
D of contexts, D C E*, a set of available context 
extensions E, E C D x E, and a probability func- 
tion I : E ---* \[0, 1\]. For every context w in D, E(w) 
is the set of symbols available in the context w and 
A(~rlw ) is the conditional probability of the symbol 
c~ in the context w. Note that )--\]o~ A(c~\[w) < 1 for 
all contexts w in the dictionary D. 
The probability /5(h1¢ ) of a string h given the 
model ¢, h • E', is calculated as a chain of con- 
ditional probabilities (1) 
/5(h{¢) --" ~(hnlhl...hn_l,¢)~(hl...h,~_ll¢) (1) 
while the conditional probability ih(elh, ¢) of a single 
symbol ~r after the history h is defined as (2). 
{ ~(~rlh ) if (h,a) ~ E 
/3(a\]h, ¢) - 5(h)~(a\]h2h3...h,, ¢) otherwise 
(2) 
The expansion factor 6(h) ensures that/5(.\]h, ¢) is a 
probability function if/5(-Ih2.., h,~, ¢) is a probabil- 
ity function. 
1 - )~(E(h)\[h) (3) 
6(h)- 1- ~(E(h)Ih2...h,~,¢) 
Note that E(h) represents a set of symbols, and 
so by a slight abuse of notation )~(E(h)Ih ) denotes 
~\]~eE(h) A(a\[h), ie., the sum of A(alh ) over all ~ in 
E(h). 
Examplel. Let E:{0,1},D: {e,"0" },E(e) 
- {0, 1}, E("0") -= {0}. Suppose A(010 = ½, X(lle) 
= ½, and A(01"0" ) = 3 Then 6("0") = 1/1 _ 1 ~. ~-y 
and i6(11"0",¢ ) : 5("0")  (l\[e) - 1 
The fundamental difference between a context 
model and an extension model lies in the inputs 
to the context selection rule, not its outputs. The 
traditional context model includes a selection rule 
s : E* --~ D whose only input is the history. In con- 
trast, an extension model includes a selection rule 
s : E* x E --+ D whose inputs include the past 
history and the symbol to be predicted. This dis- 
tinction is preserved even if we generalize the selec- 
tion rule to select a set of candidate contexts. Un- 
der such a generalization, the context model would 
map every history to a set of candidate contexts, 
ie., s : E* ---* 2 D , while an extension model would 
map every history and symbol to a set of candidate 
contexts, ie., s : E* x E --* 2 D. 
Our extension selection rule s : E* x E --+ D is de- 
fined implicitly by the set E of extensions currently 
in the model. The recursion in (2) says that each 
symbol should be predicted in its longest candidate 
context, while the expansion factor 6(h) says that 
longer contexts in the model should be trusted more 
than shorter contexts when combining the predic- 
tions from different contexts. 
An extension model ¢ is valid iff it satisfies the 
following constraints: 
a. eC DAE(c) :E 
c. Vw • D \[E(w) : E =¢, ~oez(~o) A(~iw) : 1\] 
(4) 
These constraints suffice to ensure that the model ¢ 
defines a probability function. Constraint (4a) states 
that every symbol has the empty string as a context. 
This guarantees that every symbol will always have 
at least one context in every history and that the re- 
cursion in (2) will terminate. Constraint (45) states 
that the sum of the probabilities of the extensions 
E(w) available in in a given context w cannot sum 
221 
to more than unity. The third constraint (4c) states 
that the sum of the probabilities of the extensions 
E(w) must sum exactly to unity when every symbol 
is available in that context (ie., when E(w) : E). 
Lemma 2.1 VyEE* Vcr62E 
\[ fi(~\]lY) : 1 :~/\](EIqy) = 1 \] 
Proof. By the definition of 6(~ry). 
Theorem 1 If an exlension model ¢ is valid, then 
vn \]S,es,, = 1. 
Proof. By induction on n. For the base case, 
n : 1 and the statement is true by the definition of 
validity (constraints 4a and 4c). The induction step 
is true by lemma 2.1 and definition (1). \[\] 
2.2 Parameter Estimation 
Let us now estimate the conditional probabilities 
A(.\[-) required for an extension model. Traditionally, 
these conditional probabilities are estimated using 
string frequencies obtained from a training corpus. 
Let c(c~\[w) be the number of times that the symbol 
followed the string w in the training corpus, and let 
c(w) be the sum ~es c(crlw) of all its conditional 
frequencies. 
Following Moffat (1990), we first partition the 
conditional event space E in a given context w 
into two subevents: the symbols q(w) that have 
previously occurred in context w and those that 
q(w) that have not. Formally, q(w) - {(r : 
c(,r\[w) > 0} and ~(w) - E - q(w). We estimate 
)~c(q(w)lw ) as e(w)/(c(w) + #(w)) and )~c(4(w)\[w) 
as #(w)/(c(w)+ #(w)) where #(w) is the to- 
tal weight assigned to the novel events q(w) in 
the context w. Currently, we calculate #(w) 
as min(\[q(w)l, Iq(w)\[) so that highly variable con- 
texts receive more flattening, but no novel symbol 
in ~(w) receives more than unity weight. Next, 
)~c(alq(w ), w) is estimated as c(alw)/c(w ) for the 
previously seen symbols c~ e q(w) and Ac((r\]4(w), w) 
is estimated uniformly as 1/\[4(w)\[ for the novel sym- 
bols ~r • 4(w). Combining these estimates, we ob- 
tain our overall estimate (5). 
c( lw) 
c(w) + #(w) if c~ • q(w) 
Ae (alw) = #(w) otherwise 
+ 
O) Unlike Moffat, our estimate (5) does not use escape 
probabilities or any other form of context blending. 
All novel events 4(w) in the context w are assigned 
uniform probability. This is suboptimal but simpler. 
We note that our frequencies are incorrect when 
used in an extension model that contains contexts 
that are proper suffixes of each other. In such a sit- 
uation, the shorter context is only used when the 
longer context was not used. Let y and xy be two 
distinct contexts in a model ¢. Then the context y 
will never be used when the history is E*xy. There- 
fore, our estimate of A(.ly ) should be conditioned on 
the fact that the longer context xy did not occur. 
The interaction between candidate contexts can be- 
come quite complex, and we consider this problem 
in other work (Ristad and Thomas, 1995). 
Parameter estimation is only a small part of the 
overall model estimation problem. Not only do we 
have to estimate the parameters for a model, we have 
to find the right parameters to use! To do this, we 
proceed in two steps. First, in section 2.3, we use 
the minimum description length (MDL) principle to 
quantify the total merit of a model with respect to 
a training corpus. Next, in section 2.4, we use our 
MDL codelengths to derive a practical model selec- 
tion algorithm with which to find a good model in 
the vast class of all extension models. 
2.3 Codelength Formulas 
The goal of this section is to establish the proper ten- 
sion between model complexity and data complexity, 
in the fundamental units of information. Although 
the MDL framework obliges us to propose particu- 
lar encodings for the model and the data, our goal 
is not to actually encode the data or the model. 
Given an extension model ¢ and a text corpus T, 
ITI = t, we define the total codelength L(T,¢I(I)) 
relative to the model class ~ using a 2-part code. 
L(T, ¢\[(I)) : L(¢I~ ) + L(TI¢ , ~) 
Since conditioning on the model class (I) is always 
understood, we will henceforth suppress it in our 
notation. 
Firstly, we will encode the text T using the prob- 
ability model ¢ and an arithmetic code, obtaining 
the following codelength. 
L(T\[¢) = - logif(Tl¢ ) 
Next, we encode the model ¢ in three parts: the con- 
text dictionary as L(D), the extensions as L(EID), 
and the conditional frequencies c(.\[-) as L(e\[D, E). 
The dictionary D of contexts forms a suffix tree 
containing ni vertices with branching factor i. The 
m tree contains n = )--~i=l ni internal vertices and 
no leaf vertices. There are (no + nl + ... + nm - 
1)!/no!nl!...nm! such trees (Knuth, 1986:587). Ac- 
cordingly, this tree may be encoded with an enumer- 
ative code using L(D) bits. 
LID): Lz(n)+log( n+m-lm_l ) 
+log (no + nl + ...+ nm - 1)! 
no!nl!...nm! 
rn-1 
i +Lz<(\[\[DJ\[,n) 
i=l 
+ log ( n + I LDJl 1 JLDJI-7 ) 
\ 
222 
where \[DJ is the set of all contexts in D that are 
proper suffixes of another context in D. The first 
term encodes the number n of internal vertices using 
the Elias code. The second term encodes the counts 
{nl, n2,..., am}. Given the frequencies of these in- 
ternal vertices, we may calculate the number no of 
leaf vertices as no = 1 + n2 + 2n3 + 3n4 +... + (m - 
1)am. The third term encodes the actual tree (with- 
out labels) using an enumerative code. The fourth 
term assigns labels (ie., symbols from E) to the edges 
in the tree. At this point the decoder knows all con- 
texts which are not proper suffixes of other contexts, 
ie., D - LD\]. The fourth term encodes the magni- 
tude of \[D\] as an integer bounded by the number n 
of internal vertices in the suffix tree. The fifth term 
identifies the contexts \[DJ as interior vertices in the 
tree that are proper suffices of another context in D. 
Now we encode the symbols available in each con- 
text. Let mi be the number of contexts that have 
exactly i extensions, ie., mi - J{w: JE(w)l = i}l. 
7"n Observe that ~i=1 mi = IDI. 
() E m -F rni log i 
i--1 
The first term represents the encoding of {mi } while 
the second term represents the encoding IE(w)l for 
each w in D. The third term represents the encoding 
of E(w) as a subset of E for each w in D. 
Finally, we encode the frequencies c(~rlw) used to 
estimate the model parameters 
wED + g ,o, ( C(°) + ) 
IE(w)l 
where \[y\] consists of all contexts that have y as their 
maximal proper suffix, ie., all contexts that y imme- 
diately dominates, and \[y\] is the maximal proper 
suffix of y in D, ie., the unique context that imme- 
diately dominates y. The first term encodes ITI with 
an Elias code and the second term recursively parti- 
tions c(w) into c(\[w\]) for every context w. The third 
term partitions the context frequency c(w) into the 
available extensions c(E(w)lw ) and the "unallocated 
frequency" c(E- E(w)lw) = c(w) - c(E(w)\[w) in the 
context w. 
2.4 Model Selection 
The final component of our contribution is a model 
selection algorithm for the extension model class ~. 
Our algorithm repeatedly refines the accuracy of our 
model in increasingly long contexts. Adding a new 
parameter to the model will decrease the codelength 
of the data and increase the codelength of the model. 
Accordingly, we add a new parameter to the model 
only if doing so will decrease the total codelength of 
the data and the model. 
The incremental cost and benefit of adding a sin- 
gle parameter to a given context cannot be accu- 
rately approximated in isolation from any other pa- 
rameters that might be added to that context. Ac- 
cordingly, the incremental cost of adding the set E' 
of extensions to the context w is defined as (6) while 
the incremental benefit is defined as (7). 
ALe(w, E') - L(¢ U ({w} × E')) - L(¢) (6) 
ALT(W, E') - L(TI¢ ) - L(T\[¢ U ({w} x E')) (7) 
Keeping only significant terms that are monoton- 
ically nondecreasing, we approximate the incremen- 
tal cost ALe(w, E') as 
loglDl+log IS'l 
+ log c(Lwj) + log ( c(w)ls, i + C 'I ) 
The first term represents the incremental increase 
in the size of the context dictionary D. The second 
term represents the cost of encoding the candidate 
extensions E(w) = E ~. The third term represents 
(an upper bound on) the cost of encoding c(w). The 
fourth term represents the cost of encoding c(.Iw ) 
for E(w). Only the second and fourth terms are 
signficant. 
Let us now consider the incremental benefit of 
adding the extensions E' to a given context w. The 
addition of a single parameter (w, ~r) to the model 
¢ will immediately change A(alw), by definition of 
the model class. Any change to A(.Iw ) will also 
change the expansion factor 5(w) in that context, 
which may in turn change the conditional probabili- 
ties ~(E-E(w)lw, ¢) of symbols not available in that 
context. Thus the incremental benefit of adding the 
extensions E' to the context w may be calculated as 
ALT(w,E') -- c(E - E'lw)log 1 - A(E'Iw) 
1 - ~(S'l~, ¢) 
+ ~ c('/Iwll°g~(~,lw,¢ ) 
a' E Fd 
The first term represents the incremental benefit (in 
bits) for evaluating E - E' in the context w using 
the more accurate expansion factor 5(w). The sec- 
ond term represents the incremental benefit (in bits) 
of using the direct estimate A(a'lw ) instead of the 
model probability/5(cr'lw, ¢) in the context w. Note 
that A(a'lw) may be more or less than/~(cr'lw , ¢). 
Now the incremental cost and benefit of adding 
a single extension (w, cr) to a model that already 
contains the extensions (w, El/ may be defined as 
follows. 
ALe(w, E', a) -- ALe(w, E' U {a}) - ALe(w, E') 
223 
ALT(w, ~', a) - ALT(w, ~' U {a}) - ALT(W, ~') 
Let us now use these incremental cost/benefit for- 
mulas to design a simple heuristic estimation algo- 
rithm for the extension model. The algorithm con- 
sists of two subroutines. Refine(D,E,n) augments 
the model with all individually profitable extensions 
of contexts of length n. It rests on the assump- 
tion that adding a new context does not change 
the model's performance in the shorter contexts. 
Extend(w) determines all profitable extensions of the 
candidate context w, if any exist. Since it is not 
feasible to evaluate the incremental profit of every 
subset of E, Extend(w) uses a greedy heuristic that 
repeatedly augments the set of profitable extensions 
of w by the single most profitable extension until it 
is not longer profitable to do so. 
Refine( D,E,n) 
1. D, := {};E, := {}; 
2. Cn := {w: w e Cn-1 ~'\]~ A c(w) > Cmi.} ; 
3. if (( n > nm~=) V (ICnl = 0)) then return; 
4. for w E Cn 
5. ~' := Extend(w); 
6. if ISI > o then D. :-- Dn U {w}; En(w) := S; 
7. D:=DUDn;E:=EUEn; 
8. Refine( D,E,n -F 1); 
Cn is the set of candidate contexts of length n, 
obtained from the training corpus. Dn is the set of 
profitable contexts of length n, while En is the set 
of profitable extensions of those contexts. 
Extend(w) 
1. S :: {}; 
2. o" := argmaxoe~. {AL(w, {at})} 
3. while (AL(w,S,~) > O) 
4. S := S U {a}; 
S. o" := argrnax.e\]g_ s {AL(w, ,S', ¢r)} 
6. return(S); 
The loop in lines 3-5 repeatedly finds the single 
most profitable symbol a with which to augment 
the set S of profitable extensions. The incremental 
profit AL(...) is the incremental benefit ALT(...) 
minus the incremental cost ALe(...). 
Our breadth-first search considers shorter con- 
texts before longer ones, and consequently the deci- 
sion to add a profitable context y may significantly 
decrease the benefit of a more profitable context xy, 
particularly when c(xy) ~ c(y). For example, con- 
sider a source with two hidden states. In the first 
state, the source generates the alphabet E = {0, 1,2} 
uniformly. In the second state, the source generates 
the string "012" with certainty. With appropriate 
state transition probabilities, the source generates 
strings where c(0) ~ c(1) ~ e(2), c(211)/c(1 ) >> 
c(21e)/c(c), and c(2101)/c(01 ) > c(211)/c(1 ). In such 
a situation, the best context model includes the con- 
texts "0" and "01" along with the empty context 
c. However, the divergence heuristic will first deter- 
mine that the context "1" is profitable relative to the 
empty context, and add it to the model. Now the 
profitability of the better context °'01" is reduced, 
and the divergence heuristic may therefore not in- 
clude it in the model. This problem is best solved 
with a best first search. Our current implementation 
uses a breadth first search to limit the computational 
complexity of model selection. 
Finally, we note that our parameter estimation 
techniques and model selection criteria are compara- 
ble in computational complexity to Rissanen's con- 
text models (1983, 1986). For that reason, extension 
models should be amendable to efficient online im- 
plementation. 
3 Empirical Results 
By means of the following experiments, we hope 
to demonstrate the utility of our context modeling 
techniques. All results are based on the Brown cor- 
pus, an eclectic collection of English prose drawn 
from 500 sources across 15 genres (Francis and 
Kucera, 1982). The irregular and nonstationary na- 
ture of this corpus poses an exacting test for sta- 
tistical language models. We use the first 90% of 
each file in the corpus to estimate our models, and 
then use the remaining 10% of each file in the corpus 
to evaluate the models. Each file contains approx- 
imately 2000 words. Due to limited computational 
resources, we set nmax = 10, Cmin -~- 8, and restrict 
our our alphabet size to 70 (ie., all printing ascii 
characters, ignoring case distinction). 
Our results are summarized in the following ta- 
ble. Message entropy (in bits/symbol) is for the 
testing corpus only, as per traditional model vali- 
dation methodology. The nonmonotonic extension 
model (NEM) outperforms all other models for all 
orders using vastly fewer parameters. Its perfor- 
mance all the more impressive when we consider that 
no context blending or escaping is performed, even 
for novel events. 
We note that the test message entropy of the n- 
gram model class is minimized by the 5-gram at 2.38 
bits/char. This result for the 5-gram is not honest 
because knowledge of the test set was used to select 
the optimal model order. Jelinek and Mercer (1980) 
have shown to interpolate n-grams of different or- 
der using mixing parameters that are conditioned 
on the history. Such interpolated Markov sources 
are considerably more powerful than traditional n- 
grams but contain even more parameters. 
The best reported results on the Brown Corpus 
are 1.75 bits/char using a large interpolated trigram 
word model whose parameters are estimated using 
over 600,000,000 words of proprietary training data 
(Brown et.al., 1992). The use of proprietary training 
data means that these results are not independently 
repeatable. In contrast, our results were obtained 
using only 900,000 words of generally available train- 
ing data and may be independently verified by any- 
224 
Model 
NEM 
NCM 
MCM1 
MCM2 
n-gram 
Parameters Entropy 
89,325 1.97 
687,276 2.19 
88,945,904 2.43 
88,945,904 3.12 
506,352,021,176,052 3.74 
Table 1: Results for the nonmonotonic extension 
model (NEM), the nonmonotonic context model 
(NCM), Rissanen's (1983,1986) monotonic context 
models (MCM1, MCM2) and the n-gram model. All 
models are order 7. The rightmost column contains 
test message entropy in bits/symbol. NEM outper- 
forms all other model classes for all orders using sig- 
nificantly fewer parameters. It is possible to reduce 
the test message entropy of the NEM and NCM to 
1.91 and 1.99, respectively, by quadrupling the num- 
ber of model parameters. 
one with the inclination to do so. The amount of 
training data is known to be a significant factor in 
model performance. Given a sufficiently rich dictio- 
nary of words and a sufficiently large training corpus, 
a model of word sequences is likely to outperform an 
otherwise equivalent model of character sequences. 
For these three reasons - repeatability, training cor- 
pus size, and the advantage of word models over 
character models - the results reported by Brown 
et.al (1992) are not directly comparable to those re- 
ported here. 
Section 3.1 compares the statistical efficiency of 
the various context model classes. Next, sec- 
tion 3.2 anecodatally examines the complex interac- 
tions among the parameters of an extension model. 
3.1 Model Class Comparison 
Given the tremendous risk of overfitting, the most 
important property of a model class is arguably its 
statistical efficiency. Informally, statistical efficiency 
measures the effectiveness of individual parameters 
in a given model class. A high efficiency indicates 
that our model class provides a good description of 
the data. Conversely, a low efficiency indicates that 
the model class does not adequately describe the ob- 
served data. 
In this section, we compare the statistical effi- 
ciency of three model classes: context models, ex- 
tension models, and fixed-length Markov processes 
(ie., n-grams). Our model class comparison is based 
on three criteria of statistical efficiency: total code- 
length, bits/parameter on the test message, and 
bits/order on the test message. The context and 
extension models are all of order 9, and were es- 
timated using the true incremental benefit and a 
range of fixed incremental costs (between 5 and 25 
bits/extension for the extension model and between 
25 and 150 bits/context for the context model). 
According to the first criteria of statistical effi- 
ciency, the best model is the one that achieves the 
smallest total codelength L(T, ¢) of the training cor- 
pus T and model ¢ using the fewest parameters. 
This criteria measures the statistical efficiency of a 
model class according to the MDL framework, where 
we would like each parameter to be as cheap as pos- 
sible and do as much work as possible. Figure 1 
graphs the number of model parameters required to 
achieve a given total codelength for the training cor- 
pus and model. The extension model class is the 
overwhelming winner. 
......... N. um ,be. r, of Param,et?rs.. vs: Codele, ng~ ........ 3,5 
"l t ..... M-... 2,3,4 ngrarn 
• exle~lslon model .-~ E - - ~-- context mod~ ....'" 
• """ \[ 
..'""" i 
• '"" I 
'"'""'"'"" i 
15 .m" ~- Btm ~ 
10000 100000 1000000 I(XX)OO(X) 
Number of parameters 
Figure 1: The relationship between the number of 
model parameters and the total codelength L(T, ¢) 
of the training corpus T and the model ¢. By this 
criteria of statistical efficiency, the extension models 
completely dominate context models and n-grams. 
According to the second criteria of statistical effi- 
ciency, the best model is the one that achieves the 
lowest test message entropy using the fewest param- 
eters. This criteria measures the statistical efficiency 
of a model class according to traditional model vali- 
dation methodology, tempered by a healthy concern 
for overfitting. Figure 2 graphs the number of model 
parameters required to achieve a given test message 
entropy for each of the three model classes. Again, 
the extension model class is the clear winner. (This 
is particularly striking when the number of parame- 
ters is plotted on a linear scale.) For example, one of 
our extension models saves 0.98 bits/char over the 
trigram while using less than 1/3 as many param- 
eters. Given the logarithmic nature of codelength 
and the scarcity of training data, this is a significant 
improvement. 
According to the third criteria of statistical effi- 
ciency, the best model is one that achieves the low- 
est test message entropy for a given model order. 
This criteria is widely used in the language model- 
ing community, in part because model order is typi- 
225 
"C" 
& 
i 's 
3.0. . ....... , ........ i ........ = . 
i e~slon mo~l 
2.8 22 27. =~°~"~ ..... 3,4 gram "'"'",........ 
"'... 
2.6 '.... 
"'x 
2.4 
2.2 
2.0 
L8 
10 4 10 s 10 6 10 7 
Number of Parameters vs. Message Entropy 
........ , ........ i ........ = ........ 
N~r~ Pa~em to ~ 
Figure 2: The relationship between the number of 
model parameters and test message entropy. The 
most striking fact about this graph is the tremen- 
dous efficiency of the extension model. 
cally -- although not necessarily -- related to both 
the number of model parameters and the amount of 
computation required to estimate the model. Fig- 
ure 3 compares model order to test message entropy 
for each of the three model classes. As the order 
of the models increases from 0 (ie., unigram) to 10, 
we naturally expect the test message entropy to ap- 
proach a lower bound, which is itself bounded below 
by the true source entropy. By this criteria, the ex- 
tension model class is better than the context model 
class, and both are significantly better than the n- 
gram. 
4.4 
4.2 
4.0 
3.8 
3.6 
3.4 
3.2 
3.0 
2.8 
2.6 
2.4 
2.2 
2.0 
1.8 
Model Order vs. Message Entropy 
..... • .. ...... . --:-- :~Z~m=~ ~ 
2 4 6 8 10 
Model Order 
Figure 3: The relationship between model order and 
test message entropy. The extension model class is 
the clear winner by this criteria as well. 
3.2 Anecdotes 
It is also worthwhile to interpret the parameters of 
the extension model estimated from the Brown Cor- 
pus, to better understand the interaction between 
our model class and our heuristic model selection al- 
gorithm. According to the divergence heuristic, the 
decision to add an extension (w, ~) is made relative 
to that context's maximal proper suffix LwJ in D as 
well as any other extensions in the context w. An 
extension (w, ~) will be added only if the direct es- 
timate of its conditional probability is significantly 
different from its conditional probability in its maxi- 
mal proper suffix after scaling by the expansion fac- 
tor in the context w, ie., if A(alw ) is significantly 
different than 6(w)~(c~ I LwJ). 
This is illusrated by the three contexts and six 
extensions shown immediately below, where +E(w) 
includes all symbols in E(w) that are more likely 
in w than they were in \[wJ and -E(w) includes all 
symbols in E(w) that are less likely in w than they 
were in L J. 
W 
"blish" 
"ouestablish" 
"euestablish" 
+E(w) -E(w) 
e,i,m 
U 
m e 
The substring blish is most often followed by the 
characters 'e', 5', and 'm', corresponding to the rel- 
atively frequent word forms publish{ ed, er, ing} and 
establish{ ed, ing, ment}. Accordingly, the context 
"blish" has three positive extensions {e,i,m}, of 
which e has by far the greatest probability. The 
context "blish" is the maximal proper suffix of two 
other contexts in the model, "ouestablish" and 
"euestablish". 
The substring o establish occurs most frequently 
in the gerund to establish, which is nearly always 
followed by a space. Accordingly, the context 
"ouestablish" has a single positive extension "u". 
The substring o establish is also found before the 
characters 'm', 'e', and 'i' in sequences such as 
to establishments, {who, ratio, also} established, and 
{ to, into, also} establishing. Accordingly, the context 
"ouestablish" does not have any negative exten- 
sions. 
In contrast, the substring e establish is overwhelm- 
ingly followed by the character 'm', rarely followed 
by 'e', and never followed by either 'i' or space. For 
this reason, the context "euestablish" has a sin- 
gle positive extension {m} corresponding to the great 
frequency of the string the establishment. This con- 
text also has single negative extension {e}, corre- 
sponding to the fact that the character 'e' is still pos- 
sible in the context "euestablish" but considerably 
less likely than in that context's maximal proper suf- 
fix "blish". 
Since 'i' is reasonably likely in the context 
"blish" but completely unlikely in the context 
"euestablish", we may well wonder why the model 
226 
does not include the negative extension 'i' in addi- 
tion to 'e' or even instead of 'e'. This puzzle is ex- 
plained by the expansion factor as follows. Since 
the substring e establish is only followed by 'm' and 
'e', the expansion factor ~("e,,establish") is essen- 
tially zero after 'm' and 'e' are added to that con- 
text, and therefore ~(~- {m, e}l "euestablish") 
is also essentially zero. Thus, 'i' and space are 
both assigned nearly zero probability in the con- 
text "e, ,establish", simply because 'm' and 'e' get 
nearly all the probability in that context. 
4 Conclusion 
In ongoing work, we are investigating extension mix- 
ture models as well as improved model selection al- 
gorithms. An extension mixture model is an exten- 
sion model whose ~(~lw) parameters are estimated 
by linearly interpolating the empirical probability 
estimates for all extensions that dominate w with 
respect to c~, ie., all extensions whose symbol is 
and whose context is a suffix of w. Extension mix- 
ing allows us to remove the uniform flattening of 
zero frequency symbols in our parameter estimates 
(5). Preliminary results are promising. The idea of 
context mixing is due to Jelinek and Mercer (1980). 
Our results highlight the fundamental tension be- 
tween model complexity and data complexity. If the 
model complexity does not match the data complex- 
ity, then both the total codelength of the past obser- 
vations and the predictive error increase. In other 
words, simply increasing the number of parameters 
in the model does not necessarily increase predictive 
power of the model. Therefore, it is necessary to 
have a a fine-grained model along with a heuristic 
model selection algorithm to guide the expansion of 
the model in a principled manner. 
Acknowledgements. Thanks to Andrew Appel, 
Carl de Marken, and Dafna Scheinvald for their cri- 
tique. The paper has benefited from discussions with 
the participants of DCC95. Both authors are par- 
tially supported by Young Investigator Award IRI- 
0258517 to the first author from the National Science 
Foundation. The second author is additionally sup- 
ported by a tuition award from the Princeton Uni- 
versity Research Board. The research was partially 
supported by NSF SGER IRI-9217208. 

References 
BROWN, P., PIETRA, V. D., PIETRA, S. D., LAI, 
J., AND MERCER, R. An estimate of an upper 
bound for the entropy of English. Computational 
Linguistics 18 (1992), 31-40. 
CLEARY, J., AND WITTEN, I. Data compression 
using adaptive coding and partial string matching. 
IEEE Trans. Comm. COM-32, 4 (1984), 396-402. 
FRANCIS, W. N., AND KUCERA, H. Frequency 
analysis of English usage: lexicon and grammar. 
Houghton Mifflin, Boston, 1982. 
JELINEK, F., AND MERCER, a. L. Interpolated es- 
timation of Markov source parameters from sparse 
data. In Pattern Recognition in Practice (Amster- 
dam, May 21-23 1980), E. S. Gelsema and L. N. 
Kanal, Eds., North Holland, pp. 381-397. 
KNUTH, D. E. The Art of Computer Programming, 
1 ed., vol. 1. Addison-Wesley, Reading, MA, 1968. 
MOFFAT, A. Implementing the PPM data compre- 
sion scheme. IEEE Trans. Communications 38, 
11 (1990), 1917-1921. 
RISSANEN, J. Modeling by shortest data descrip- 
tion. Automatica 14 (1978), 465-471. 
RISSANEN, J. A universal data compression system. 
IEEE Trans. Information Theory IT-29, 5 (1983), 
656-664. 
RISSANEN, J. Complexity of strings in the class of 
Markov sources. IEEE Trans. Information The- 
ory IT-32, 4 (1986), 526-532. 
RISTAD, E. S., AND THOMAS, R. G. Context 
models in the MDL framework. In Proceedings of 
5th Data Compression Conference (Los Alamitos, 
CA, March 28-30 1995), J. Storer and M. Cohn, 
Eds., IEEE Computer Society Press, pp. 62-71. 
