Hierarchical Non-Emitting Markov Models 
Eric Sven Ristad and Robert G. Thomas 
Department of Computer Science 
Princeton University 
Princeton, NJ 08544-2087 
{ristad, rgt )©cs. princeton, edu 
Abstract 
We describe a simple variant of the inter- 
polated Markov model with non-emitting 
state transitions and prove that it is strictly 
more powerful than any Markov model. 
Empirical results demonstrate that the 
non-emitting model outperforms the inter- 
polated model on the Brown corpus and 
on the Wall Street Journal under a wide 
range of experimental conditions. The non- 
emitting model is also much less prone to 
overtraining. 
1 Introduction 
The Markov model has long been the core technol- 
ogy of statistical language modeling. Many other 
models have been proposed, but none has offered a 
better combination of predictive performance, com- 
putational efficiency, and ease of implementation. 
Here we add hierarchical non-emitting state tran- 
sitions to the Markov model. Although the states 
in our model remain Markovian, the model itself 
is no longer Markovian because it can represent 
unbounded dependencies in the state distribution. 
Consequently, the non-emitting Markov model is 
strictly more powerful than any Markov model, in- 
cluding the context model (Rissanen, 1983; Rissa- 
nen, 1986), the backoff model (Cleary and Witten, 
1984; Katz, 1987), and the interpolated Markov 
model (Jelinek and Mercer, 1980; MacKay and Peto, 
1994). 
More importantly, the non-emitting model consis- 
tently outperforms the interpolated Markov model 
on natural language texts, under a wide range of 
experimental conditions. We believe that the su- 
perior performance of the non-emitting model is 
due to its ability to better model conditional inde- 
pendence. Thus, the non-emitting model is better 
able to represent both conditional independence and 
long-distance dependence, ie., it is simply a better 
statistical model. The non-emitting model is also 
nearly as computationally effÉcient and easy to im- 
plement as the interpolated model. 
The remainder of our article consists of four sec- 
tions. In section 2, we review the interpolated 
Markov model and briefly demonstrate that all inter- 
polated models are equivalent to some basic Markov 
model of the same model order. Next, we introduce 
the hierarchical non-emitting Markov model in sec- 
tion 3, and prove that even a lowly second order 
non-emitting model is strictly more powerful than 
any basic Markov model, of any model order. In 
section 4, we report empirical results for the inter- 
polated model and the non-emitting model on the 
Brown corpus and Wall Street Journal. Finally, in 
section 5 we conjecture that the empirical success of 
the non-emitting model is due to its ability to bet- 
ter model a point of apparent independence, such as 
may occur at a sentence boundary. 
Our notation is as follows. Let A be a finite alpha- 
bet of distinct symbols, \[A\[ = k, and let z T 6 A T 
denote an arbitrary string of length T over the al- 
phabet A. Then z~ denotes the substring of z T that 
begins at position i and ends at position j. For con- 
venience, we abbreviate the unit length substring z~ 
as zi and the length t prefix of z T as z*. 
2 Background 
Here we review the basic Markov model and the in- 
terpolated Markov model, and establish their equiv- 
alence. 
A basic Markov model ¢ = (A,n,6,) consists of 
an alphabet A, a model order n, n > 0, and the 
state transition probabilities 6, : A n x A ---* \[0, 1\]. 
With probability 6,(y\[zn), a Markov model in the 
state z '~ will emit the symbol y and transition to the 
state z'~y. Therefore, the probability Prn(ZtlX t-1 , ¢) 
assigned by an order n basic Markov model ¢ to a 
symbol z' in the history z t-1 depends only on the 
last n symbols of the history. 
£ ,'~ I,Tt-l\ pm(z, lz'-l,¢)=~.~ ,I ,-.J (1) 
An interpolated Markov model ¢ = (A,n,A,6) 
consists of a finite alphabet A, a maximal model or- 
der n, the state transition probabilities 6 = 60 ... 6,, 
6i : A i x A ~ \[0, 1\], and the state-conditional inter- 
polation parameters A = A0... An, Ai : A i ---* \[0, 1\]. 
381 
The probability assigned by an interpolated model 
is a linear combination of the probabilities assigned 
by all the lower order Markov models. 
p0(yl ', ¢) = 
+(1 - Ai(zi))p¢(ylz~, ¢) (2) 
where )q(z i) = 0 for i > n, and and therefore 
p~(z, lzt-1, ¢) ,-7 = p¢(ztlzt_,~,¢), ie., the prediction 
depends only on the last n symbols of the history. 
In the interpolated model, the interpolation pa- 
rameters smooth the conditional probabilities esti- 
mated from longer histories with those estimated 
from shorter histories (:lelinek and Mercer, 1980). 
Longer histories support stronger predictions, while 
shorter histories have more accurate statistics. In- 
terpolating the predictions from histories of different 
lengths results in more accurate predictions than can 
be obtained from any fixed history length. 
A quick glance at the form of (2) and (1) re- 
veals the fundamental simplicity of the interpolated 
Markov model. Every interpolated model ¢ is equiv- 
alent to some basic Markov model ¢' (temma 2.1), 
and every basic Markov model ¢ is equivalent to 
some interpolated context model ¢' (lemma 2.2). 
Lemma 2.1 
V¢ 3qJ' VZ T E A* ~m(ZTI¢',T) : pe(zTI¢,T)\] 
Proof. We may convert the interpolated model ¢ 
into a basic model ¢' of the same model order n, 
simply by setting 6"(ylz n) equal to pc(y\[z n, ¢) for 
all states z n E A n and symbols y 6 A. \[\] 
Lemma 2.2 
V¢ ~¢t vzT 6 A* \[pc(zTI¢',T) = pm(xT\]¢,T)\] 
Proof. Every basic model is equivalent to an inter- 
polated model whose interpolation values are unity 
for states of order n. \[\] 
The lemmas suffice to establish the following the- 
orem. 
Theorem 1 The class of interpolated Markov mod- 
els is equivalent to the class of basic Markov models. 
Proof. By lemmas 2.1 and 2.2. f"l 
A similar argument applies to the backoff model. 
Every backoff model can be converted into an equiv- 
alent basic model, and every basic model is a backoff 
model. 
3 Non-Emitting Markov Models 
A hierarchical non-emitting Markov model ¢ = 
(A,n, A,5) consists of an alphabet A, a maximal 
model order n, the state transition probabilities, 
5 = 5o...6n, 6i : A i x A ~ \[0,1\], and the non- 
emitting state transition probabilities A = A0 ... An, 
hi : A i ---* \[0, 1\]. With probability 1 - Ai(zi), a non- 
emitting model will transition from the state z i to 
the state z~ without emitting a symbol. With proba- 
bility A/(z')~i (Y\[Z i), a .non-emitting model will tran- 
sition from the state z* to the state z'y and emit the 
symbol y. 
Therefore, the probability pe(yJ \[z i, ¢) assigned to 
a string yJ in the history x i by a non-emitting model 
¢ has the recursive form (3), 
= 
+(1 - ¢) (3) 
where Ai(z i) = 0 for i > n and A0(e) = 1. Note that, 
unlike the basic Markov model, p~(ztlzt-l,¢) # 
t--1 pe(ztlzt_n, ¢) because the state distribution of the 
non-emitting model depends on the prefix zi-n: 
This simple fact will allow us to establish that there 
exists a non-emitting model that is not equivalent to 
any basic model. 
Lemma 3.1 states that there exists a non-emitting 
model ¢ that cannot be converted into an equivalent 
basic model of any order. There will always be a 
string z T that distinguishes the non-emitting model 
¢ from any given basic model ¢' because the non- 
emitting model can encode unbounded dependencies 
in its state distribution. 
Lemma 3.1 
3¢ V¢' 3z T E A* \[p,(zTI¢,T) # pm(zT\[¢',T)\] 
Proof. The idea of the proof is that our non- 
emitting model will encode the first symbol Zl of 
the string z T in its state distribution, for an un- 
bounded distance. This will allow it to predict the 
last symbol ZT using its knowledge of the first sym- 
bol zl. The basic model will only be able predict the 
last symbol ZT using the preceding n symbols, and 
therefore when T is greater than n, we can arrange 
for p,(zTl¢,T) to differ from any p,~(zT\[¢',T), sim- 
ply by our choice of zl. 
The smallest non-emitting model capable of ex- 
hibiting the required behavior has order 2. The 
non-emitting transition probabilities A and the in- 
terior of the string z T-1 will be chosen so that the 
non-emitting model is either in an order 2 state or 
an order 0 state, with no way to transition from one 
to the other. The first symbol zl will determine 
whether the non-emitting model goes to the order 2 
state or stays in the order 0 state. No matter what 
probability the basic model assigns to the final sym- 
bol ZT, the non-emitting model can assign a different 
probability by the appropriate choice of Zl, 6O(ZT), 
and 
Consider the second order non-emitting model 
over a binary alphabet with )~(0) = 1, A(1) = 0, and 
A(ll) = 1 on strings in AI'A. When zl = 0, then x2 
will be predicted using the 1st order model 61(x21xl), 
and all subsequent zt will be predicted by the second 
order model 62(ztlxtt_-~). When zl = 0, then all sub- 
sequent z, will be predicted by the 0th order model 
t-1 ~5o(xt). Thus for all t > p, pc(x~\[x ~-x) ¢ p~(t\[xt_v) 
for any fixed p, and no basic model is equivalent to 
this simple non-emitting model. \[\] 
It is obvious that every basic model is also a non- 
emitting model, with the appropriate choice of non- 
382 
emitting transition probabilities. 
Lemma 3.2 
V¢ 3~' V2: T E A* \[pe(xTJ¢',T) = prn(zTl¢,T)\] 
These lemmas suffice to establish the following 
theorem. 
Theorem 2 The class of non-emitting Markov 
models is strictly more powerful than the class of ba- 
sic Markov models, because it is able to represent a 
larger class of probability distributions on strings. 
Proof. By lemmas 3.1 and 3.2. r-I 
Since interpolated models and backoff models are 
equivalent to basic Markov models, we have as 
a corollary that non-emitting Markov models are 
strictly more powerful than interpolated models and 
backoff models as well. Note that non-emitting 
Markov models are considerably less powerful than 
the full class of stochastic finite state automaton 
(SFSA) because their states are Markovian. Non- 
emitting models are also less powerful than the full: 
class of hidden Markov models. 
Algorithms to evaluate the probability of a string 
according to a non-emitting model, and to opti- 
mize the non-emitting state transitions on a train- 
ing corpus are provided in related work (Ristad and 
Thomas, 1997). 
4 Empirical Results 
The ultimate measure of a statistical model is its 
predictive performance in the domain of interest. 
To take the true measure of non-emitting models 
for natural language texts, we evaluate their per- 
formance as character models on the Brown corpus 
(Francis and Kucera, 1982) and as word models on 
the Wall Street Journal. Our results show that the 
non-emitting Markov model consistently gives bet.ter 
predictions than the traditional interpolated Markov 
model under equivalent experimental conditions: In 
all cases we compare non-emitting and interpolated 
models of identical model orders, with the same 
number of parameters. Note that the non-emitting 
bigram and the interpolated bigram are equivalent. 
Corpus Size Alphabet Blocks 
Brown 6,004,032 90 21 
WSJ 1989 6,219,350 20,293 22 
WSJ 1987-89 42,373,513 20,092 152 
All ,~ values were initialized uniformly to 0.5 and 
then optimized using deleted estimation on the first 
90% of each corpus (Jelinek and Mercer, 1980). 
DEr.ET~D-ESTIMATIoN(B,¢) 
1. Until convergence 
2. Initialize A+,,~- to zero; 
3. For each block Bi in B 
4. Initialize 6 using B - Bi; 
5. EXPECTATION-STEP( Bi ,¢,~ +,~- ); 
6. MAXIMIZATION-STEP(~b,~+ ,)~- ); 
7.Initialize ~ using B; 
Here ,~+ (zi) accumulates the expectations of emit- 
ting a, symbol from state z i while )~-(zi) accumu- 
lates the expectations of transitioning to the state 
z~ without emitting a symbol. 
The remaining 10% percent of each corpus was 
used to evaluate model performance. No parameter 
tying was performed.1 
4.1 Brown Corpus 
Our first set of experiments were with character 
models on the Brown corpus. The Brown cor- 
pus is an eclectic collection of English prose, con- 
taining 6,004,032 characters partitioned into 500 
files. Deleted estimation used 21 blocks. Re- 
sults are reported as per-character test message 
entropies (bits/char), -Llog 2p(yvjv). The non- tl 
emitting model outperforms the interpolated model 
for all nontrivial model orders, particularly for larger 
m.odel orders. The non-emitting model is consider- 
ably less prone to overtraining. After 10 EM itera- 
tions, the order 9 non-emitting model scores 2.0085 
bits/char while the order 9 interpolated model scores 
2.3338 bits/char after 10 EM iterations. 
Bto~,m Comus 
3.B ....... N<~..e~,nlng 
Ido~k Be~ EM Itorltio~1 -e--- 6 ~1~ Inta~t~lno Model: ~iI EM hemtio~ ~-, 
3. Not~emJflJn Mod~l: 10th~Mlte/itlon .o-- " Interpo4ate~ Model: lOtPI EM neritk)41 -m-- 
I\ 
3"4 f ~.~ 
2J 
2.~ 
~-~..: ....... :-..---.: ..... 
2 
t i i i s a i 1.8 2 3 4 5 6 7 8 
~ol On~r 
Figure 1: Test message entropies as a function of 
model order on the Brown corpus. 
4.2 WSJ 1989 
The second set of exPeriments was on the 1989 
Wall Street Journal corpus, which contains 6,219,350 
words. Our vocabulary consisted of the 20,293 
words that occurred at least 10 times in the en- 
tire WSJ 1989 corpus. All out-of-vocabulary words 
1 In forthcoming work, we compare the performance of 
the interpolated and non-emitting models on the Brown 
corpus and Wall Street Journal with ten different pa- 
rameter tying schemes. Our experiments confirm that 
some parameter tying schemes improve model perfor- 
mance, although only slightly. The non-emitting model 
consistently outperformed the interpolated model on all 
the corpora for all the parameter tying schemes that we 
evaluated. 
383 
WS..I 1987-'89 160 were mapped to a unique OOV symbol. Deleted 
estimation used 22 blocks. Following standard prac- 
tice in the speech recognition community, results 
are reported as per-word test message perplexities, 
p(yVlv)-¼. Again, the non-emitting model outper- 
forms the interpolated Markov model for all nontriv- 
ial model orders. 
WSJ 1989 .-- , , , 
Norl-emc~ng Model: But EM It or=tk~ 
Intsrp~ated Model: ~ EM I~er~ion ~-- 
170 
160 
150 
140 *,~, 
30 'k 
11o "*~ ..................... 
Ioo i i " L i ,, 
1 2 Model30;,der 4 
Figure 2: Test message perplexities as a function of 
model order on WSJ 1989. 
4.3 WSJ 1987-89 
The third set of experiments was on the 1987-89 Wall 
Street Journal corpus, which contains 42,373,513 
words. Our vocabulary consisted of the 20,092 words 
that occurred at least 63 times in the entire WSJ 
1987-89 corpus. Again, all out-of-vocabulary words 
were mapped to a unique OOV symbol. Deleted es- 
timation used 152 blocks. Results are reported as 
test message perplexities. As with the WS3 1989 
corpus, the non-emitting model outperforms the in- 
terpolated model for all nontrivial model orders. 
5 Conclusion 
The power of the non-emitting model comes from 
its ability to represent additional information in its 
state distribution. In the proof of lemma 3.1 above, 
we used the state distribution to represent a long dis- 
tance dependency. We conjecture, however, that the 
empirical success of the non-emitting model is due 
to its ability to remember to ignore (ie., to forget) a 
misleading history at a point of apparent indepen- 
dence. 
A point of apparent independence occurs when 
we have adequate statistics for two strings z n-1 and 
yn but not yet for their concatenation z,,-lyn. In 
the most extreme case, the frequencies of z n-1 and 
yn are high, but the frequency of even the medial 
bigram zn-lyl is low. In such a situation, we would 
like to ignore the entire history z n-1 when predicting 
y'~, because all di(yjlxn-l~ -1) will be close to zero 
x 
J 
J 
;SO 
140 
120 
110 
100 
90 
80 
Non-4mitting Modot: Be=t EM #erat)o41 Lnterpolatod Moflel: Best EM Itorlt~on ~- 
Figure 3: Test message perplexities as a function of 
model order on WSJ 1987-89. 
for i < n. To simplify the example, we assume that 
6(yjlz~-l~ -1) = 0 for j _> 1 and i < n. 
In such a situation, the interpolated model must 
repeatedly transition past some suffix of the history 
z ~-1 for each of the next n-1 predictions, and so the 
total probability assigned to pc(y nle) by the interpo- 
lated model is a product of n(n - 1)/2 probabilities. 
po(y~ I ~"-~ ) 
"-~ ))\] = \[i=~l(1-A(x~ *-1 P(Y~I~) 
n--1 \] 
... 
(1 - a(~_~yi~-l))p(yn ly ~-~) 
F,,-I r'.--i \] 
:" \[k~=li~= (1--A(X'~-ly~-I)) Pc(Yn'~) 
(4) 
In contrast, the non-emitting model will imme- 
diately transition to the empty context in order to 
predict the first symbol Yl, and then it need never 
again transition past any suffix of x n-\]. Conse- 
quently, the total probability assigned to pe(yn\[e) 
by the non-emitting model is a product of only n- 1 
probabilities. 
n--1 \] 
Given the same state transition probabilities, note 
that (4) must be considerably less than (5) because 
probabilities lie in \[0, 1\]. Thus, we believe that the 
empirical success of the non-emitting model comes 
from its ability to effectively ignore a misleading his- 
tory rather than from its ability to remember distant 
events. 
384 
Finally, we note the use of hierarchical non- 
emitting transitions is a general technique that may 
be employed in any time series model, including con- 
text models and backoff models. 
Acknowledgments 
Both authors are partially supported by Young 
Investigator Award IRI-0258517 to Eric Ristad from 
the National Science Foundation. 

References 
Lalit R. Bahl, Peter F. Brown, Peter V. de Souza, 
Robert L. Mercer, and David Nahamoo. 1991. A 
fast algorithm for deleted interpolation. In Proc. 
EUROSPEECH '91, pages 1209-1212, Genoa. 
J.G. Cleary and I.H. Witten. 1984. Data com- 
pression using adaptive coding and partial string 
matching. IEEE Trans. Comm., COM-32(4):396- 
402. 
W. Nelson Francis and Henry Kucera. 1982. Fre- 
quency analysis of English usage: lexicon and 
grammar. Houghton Mifflin, Boston. 
Fred Jelinek and Robert L. Mercer. 1980. Inter- 
polated estimation of Markov source parameters 
from sparse data. In Edzard S. Gelsema and 
Laveen N. Kanal, editors, Pattern Recognition in 
Practice, pages 381-397, Amsterdam, May 21-23. 
North Holland. 
Slava Katz. 1987. Estimation of probabilities from 
sparse data for the language model component of 
a speech recognizer. IEEE Trans. ASSP, 35:400- 
401. 
David J.C. MacKay and Linda C. Bauman Peto. 
1994. A hierarchical Dirichlet language model. 
Natural Language Engineering, 1(1). 
Jorma Rissanen. 1983. A universal data compres- 
sion system. IEEE Trans. Information Theory, 
IT-29(5):656-664. 
Jorma Rissanen. 1986. Complexity of strings in the 
class of Markov sources. IEEE Trans. Information 
Theory, IT-32(4):526-532. 
Eric Sven Ristad and Robert G. Thomas. 1997. Hi- 
erarchical non-emitting Markov models. Techni- 
cal Report CS-TR-544-96, Department of Com- 
puter Science, Princeton University, Princeton, 
NJ, March. 
Frans M. J. Willems, Yuri M. Shtarkov, and 
Tjalling J. Tjalkens. 1995. The context-tree 
weighting method: basic properties. IEEE Trans. 
Inf. Theory, 41(3):653-664. 
