Research Group for 
Quantitative Linguistic s 
Fack 
Stockholm 40 
SWEDEN 
KVAL PM 339 
June 191 1967 
The Entropy of Recursive Markov Processes 
By 
BENNY BRODDA 
The work reported in this paper has been sponsored by Humanistiska 
forskningsr~det, Tekniska forskningsr~det and Riksbankens Jubileums- 
fond, Stockholm, Sweden. '. 
\ 
THE~ENTROPY OF RECURSIVE MARKOV PROCESSES 
By 
BENNY BRODDA 
KVAL, Fack, Stockholm 40, Sweden 
Summary 
The aim of this communication is to obtain an explicit formula for calculat- 
ing the entropy of a source which behaves in accordance with the rules of an 
arbitrary Phrase Structure Grammar, in which relative probabilities are 
attached to the rules in the grammar. With this aim in mind we introduce an 
alte~rnative definition of the concept of a PSG as a set of self-embedded (re- 
Cursive) Finite State Grammars; when the probabilities are taken into account 
in such a grammar we call it a Recursive Markov Process. 
1. In the first section we give a more detailed definition of what kind of Mar- 
kov Processes we are going to generalize later on (in sec. 3), and we also 
outline the concept of entropy in an ordinary Markov source. More details "of 
information may be foupd~ e.g., in Khinchins "Mathematical Foundations of 
Information Theory", N.Y. ~ 1957~ or "Information Theory" by R. Ash, N. Y. , 
1965. 
A Markov Grammar is defined as a Markov Source with the following proper- 
tie s : 
Assume that there are n+ 1 states, say S O , S1, ..., Sn, in the source. S O is 
defined as the initial state and S is defined as the final state and the other n 
states are called intermediate states. We shall, of course, also have a transi- 
tion matrix, M = (Pij), containing the, transition probabilities of the source. 
a) A transition from state S i to state S k is always accompanied by a produc- 
tion of a (non-zero) letter aik from a given finite alphabet. Transition to 
different states from one given state alway s produce different letters. 
b) From the" initial state, S0~ direct or indirect transitions should be possible 
to any other state in the source. From no state is a transition to S O allowed. 
c) From any state, direct or indirect transitions to the final state S should 
n 
be possible. From S n no transition is allowed to any other state (S n is an 
"absorbing state"). 
The work reported in this paper has been sponsored by Humanistiska forsk- 
ningsr~det, Tekniska forskningsr~det and Riksbankens Jubileumsfond, Stock- 
holm, Swederi. 
A (grammatical) sente'nce should now be defined as the (left-to-right) conca- 
tenation of the letters produced by the source, when passing from the initial 
state to the final state. 
The length of a sentence is defined as the number of letters in the sentence. 
To simplify matters without dropping much of generality we also require that 
d) The greatest common divisor for all the possible lengths of sentences is = l 
(i.e., the source becomes an aperiodic source, if it is short-circuited by 
identifying the final and initial states). ~- 
With the properties a - d above, the source obtained by identifying the final 
and initial states is an indecomposable, ergodic Markov process (cf. Feller, 
"Probability Theory and Its Applications", ch. 15, N. Y. s 1950). 
In the transition matrix M for a Markov grammar of our type all elements 
in the first column are zero, and in the last row all elements are zero ex- 
cept the last one which is = 1. For a given Markov grammar we define the 
uncertainty or entropy, Hi, for each state S i, i = 0, 1 .... , n, as: 
n 
Hi=~l Pij l°g Pij; i= 1, Z ..... n. 
j=O 
We also define the entropy, H or H(M), for the grammar as 
n= 1 
(1). = x.H. 1 1 
i= 0 
where x = (x0, x z, ..., Xn_l) is defined as the stationary distribution-~ the 
source obtained when S O and S n are identified; thus x is defined as the (unique) 
solution to the set of simultaneous equations 
(z) xM 1 = x 
x0 + X l + ''" + Xn-1 = 1 
where M 1 is formed by shifting the last and first columns and then omitting 
the last row and column. The mean sentence length. ~, of the set of grammat- 
ical sentences can now be easily calculated as 
(3) = 1/x 0 
(cf. Feller, op. tit.) 
2. Embedded Grammars 
We now assume that we have two Markov grammars, M and M1, with states 
S O , S 1 .... , S n, and T o , T I, ..., T m, respectively, where S O ands n, T O 
and T m are the corresponding initial and final states. Now consider two 
states S i and S k in the grammar M; assume that the corresponding transition 
probability is = Pik" We now transform the grammar, M1, into a new one, 
M\], by embedding the grammar M 2 in M 1 between the states S i and Sk, an 
operation which is performed by identifying the states T O and T with the m 
states S i and S k respectively. Or, to be more precise, assume that in the 
grammar M 1 the transitions to the states Tj, j~l, has the probabilities q0j" 
Then, in the grammar M', transitions to a state T. from the state S. will 3 1 
take place with the probability =.Pikq0 j. A return to the state S k in the "main" 
grammar from an intermediate state Tj in M 1 takes place with the probability 
qjm" 
With the conditions above fulfilled, we propose that the entropy for the. com- 
posed grammar be calculated according to the formula: 
(4) H(M') = H(M) + xipik " ~I " H(M|) 
1 + xiPik (~1 - 1) 
where H(M) is the entropy of the grammar M when there is an ordinary con- 
nection (with probability Pik) between the states S i and Sk, and where x. is 
1 
the inherent probability of being in the state S. under the same conditions. 
1 
~1 is the mean sentence length of the sentences produced by the grammar 
M 1 alone. (It is quite natural that this number appears as a weight in the 
formula, since if one is producing a sentence according to the grammar M 
and arrives at the state S i and from there "dives" into the grammar M1, 
then ~1 is the expected waiting time for emerging again in the main grammar 
M.) The factor xiPik may be interpreted as the combined probability of ever 
arriving at.S i and there choosing the path over to M 1 (you may, of course, 
choose quite another path from Si). 
The proof of formula (4) is very'straightforward, once the premises accord- 
ing to the above have been given, and we omit it here, as it does not give 
much extra insight to the theory. THe formula may be extended to the case 
when there are:more than one sub-grammar embedded in the grammar M', 
by adding similar terms as the one standing, to the right in the numerator 
and the denominator. The important thing here is that the factors of the type 
x.p.~ depend only on the probability matrix for the grammar M and are de- 1 
1 pendent of the sub-grammars involved. 
3. Recur sive or Self-embedded Sources ~-. .... 
It is now quite natural to allow a grammar to have itself as a sub-grammar 
or to allow a grammar M 1" to contain a grammar M~. which, in its turn, con- 
tains M 1, and so on. The grammars thus obtained cannot, howeverB be re- 
written as an ordinary Markov grammar. The relation between an ordinary 
Markov grammar and a recursive one is~exactly similar to the relation be- 
tween Finite state Languages and Phrase Structure Languages. 
To be more precise, assume that we have a set of Markov grammars M~ 
M l ..... M~ where MI 0 is called the main grammar and in the sense that 
the process always starts at the initial state in M ~ and ceases when it 
reaches the final state in M 0. Each of the grammars may contain any number 
of the others (and itself) as sub-grammars. The only restriction is that from 
any state in any one of the grammars there should exist a path which ends up 
at the final state of M O. 
Remark 
If we interpret a source of our kind as a Phrase Structure Language, the re- 
writing rules are all of the following kind: 
(5) S i -* Aik + Sk o..r_r S n -, #; 
where the S' s are all non-terminal symbols. (They stand for the names of 
the states in the sources - M~, l~i I ..... M~and where S O is assumed to be 
the initial symbol /the Chomskyan S/ and S n is the terminating state which 
produces the sentence delimiter #. The symbols Aik ar e either terminal sym- 
bols /letters from a finite alphabet/ or non-terminal symbols equal to the 
name of the initial state in one of the grammars M~, Ni~ ..... M~ /one may 
4 
also say that Aik 
grammar/.) 
:i: 
stands as an abbreviation for an arbitrary sentence of that 
We associate each grammar M! with the grammar M., j = 0, 12 .... , N, by 3 3 
just considering it as a non-recursive one, thaf is, we consider all the sym- 
bols Aik as terminal symbols (even if they are:'not). The grammars thus ob- 
tained are ordinarily Markov grammars according to our definition, and the 
entropies Hj = H(Mj) are easily computed according to formula (1), as are 
the stationary distributions /formula (2)/. The follwoing theorem shows how 
the entropies H! for the fully recursive grammars M! are connected with the J 3 
numbers H.. J 
Theorem 
The entropy H! for a set of recursive Markov grammar Mj, j = 0, 1, J 
can be calculated according to the formula 
..., N, 
(6) 
k k 
j=0, 1 .... ,N. 
Here the factors Yjk are dependent only of the probability matrix of the 
• grammar and the numbers ~k defined as the mean sentence length of the 
sentences of the grammar M~, k = 0, 1, .... N, and computable accord- 
ing to lemma below. 
H~ is the entropy for the grammar. 
The theorem above is a direct application for the grammar of formula (4), 
sec. 2. 
The coefficients Yjk in formula (6) can, more precisely, be calculated as 
a sum of terms of the type xiPim with the indices (i, m) are where the gram- 
!" x i and are the components the sta- mar M~ appears in the grammar M3~ Pim 
tionary distribution and probability matrix for the grammar M.o 
, J 
Assume now that we have a Markov grammar of our type, but for which 
each transition will take a certain amount of time. A very natural question 
is then: "What is the expected time to produce a sentence in that language ?" 
The answer is in the following lemma. 
Lemma 
Let M be a_MMarkov grammar with states Si, i= O, 
S are the initial and final states respectively, 
n 
1 .... , n, where S O and 
Assume that each transition S i -. S k will take Ylk time units. 
Denote the expected time for arrival at S given that the grammar is in state 
n 
S i by ti, i = 0, I, ...~ n~ (thus t o is the expected time for producing asen- 
tence). The times t I will then fulfill the following set of simultaneously linear 
equations : 
(7) ti = ~ Pik (tik + tk) 
k 
Formula (7) is itself a proof of the lemrna. 
With more convenient notations we can write (7) as 
(E - P) t = Pt 
where E is the unit matrix, P is the probability matrix (with P = 0) and 
nn 
Pt is the vector with components 
Pi (t) =~ Pim tim' i = 0, 1 .... , n. 
m 
The application of ~he lemma for computing the numbers ~k in formula (6) is 
now the following. 
The transition times of the lemma are, of course, the expected time (or 
"lengths" as we have called it earlier) for passing via a sub-grammar of the 
grammar under consideration. Thus the number tik i-~\]itself the unknown en- 
title s ~k" 
6 
For each of the sub-grammars M~, j : 0, I, ..., N, we geta set of linear 
J equations of type (7) for determining the vectors t of 1emma. The first com- 
ponent of this vector, i.e.j the number t O , is then equal to the expected 
length, ~, of the sentences of that g~ammar. (Unfortunately, we have to 
compute extra the expected time for going from any state of the sub-gram- 
mars to the corresponding final state.) 
The total number of unknowns involved when computing the entropy of our 
grammar (i. e. , the entropy H~) is equal to 
(the total number of states in all our sub-grammars) plus 
(the number of sub-grammars). 
This is also the number of equatior~,_for we haven + 1 e~uations from formula 
(6) and then (n + 1) sets of equations of the type (7). We assert that all these 
simultaneous equations a~e solvable, if the grammar fulfills the conditions 
we earlier stated for the grammar, i.e., that from 'each state in any sub- 
grammar exists at least one path to the final state of that grammar. 
