Variation of Entropy and Parse Trees of Sentences as a Function of the
Sentence Number
Dmitriy Genzel and Eugene Charniak
Brown Laboratory for Linguistic Information Processing
Department of Computer Science
Brown University
Providence, RI, USA, 02912
{dg,ec}@cs.brown.edu
Abstract
In this paper we explore the variation of
sentences as a function of the sentence
number. We demonstrate that while the
entropy of the sentence increases with the
sentence number, it decreases at the para-
graph boundaries in accordance with the
Entropy Rate Constancy principle (intro-
duced in related work). We also demon-
strate that the principle holds for differ-
ent genres and languages and explore the
role of genre informativeness. We investi-
gate potential causes of entropy variation
by looking at the tree depth, the branch-
ing factor, the size of constituents, and the
occurrence of gapping.
1 Introduction and Related Work
In many natural language processing applications,
such as parsing or language modeling, sentences are
treated as natural self-contained units. Yet it is well-
known that for interpreting the sentences the dis-
course context is often very important. The later
sentences in the discourse contain references to the
entities in the preceding sentences, and this fact is
often useful, e.g., in caching for language model-
ing (Goodman, 2001). The indirect influence of the
context, however, can be observed even when a sen-
tence is taken as a stand-alone unit, i.e., without its
context. It is possible to distinguish between a set
of earlier sentences and a set of later sentences with-
out any direct comparison by computing certain lo-
cal statistics of individual sentences, such as their
entropy (Genzel and Charniak, 2002). In this work
we provide additional evidence for this hypothesis
and investigate other sentence statistics.
1.1 Entropy Rate Constancy
Entropy, as a measure of information, is often used
in the communication theory. If humans have
evolved to communicate in the most efficient way
(some evidence for that is provided by Plotkin and
Nowak (2000)), then they would communicate in
such a way that the entropy rate would be constant,
namely, equal to the channel capacity (Shannon,
1948).
In our previous work (Genzel and Charniak,
2002) we propose that entropy rate is indeed con-
stant in human communications. When read in con-
text, each sentence would appear to contain roughly
the same amount of information, per word, whether
it is the first sentence or the tenth one. Thus the
tenth sentence, when taken out of context, must ap-
pear significantly more informative (and therefore
harder to process), since it implicitly assumes that
the reader already knows all the information in the
preceding nine sentences. Indeed, the greater the
sentence number, the harder to process the sentence
must appear, though for large sentence numbers this
would be very difficult to detect. This makes intu-
itive sense: out-of-context sentences are harder to
understand than in-context ones, and first sentences
can never be out of context. It is also demonstrated
empirically through estimating entropy rate of vari-
ous sentences.
In the first part of the present paper (Sections 2
and 3) we extend and further verify these results. In
the second part (Section 4), we investigate the poten-
tial reasons underlying this variation in complexity
by looking at the parse trees of the sentences. We
also discuss how genre and style affect the strength
of this effect.
1.2 Limitations of Preceding Work
In our previous work we demonstrate that the word
entropy rate increases with the sentence number; we
do it by estimating entropy of Wall Street Journal
articles in Penn Treebank in three different ways. It
may be the case, however, that this effect is corpus-
and language-specific. To show that the Entropy
Rate Constancy Principle is universal, we need to
confirm it for different genres and different lan-
guages. We will address this issue in Section 3.
Furthermore, if the principle is correct, it should
also apply to the sentences numbered from the be-
ginning of a paragraph, rather than from the begin-
ning of the article, since in either case there is a shift
of topic. We will discuss this in Section 2.
2 Within-Paragraph Effects
2.1 Implications of Entropy Rate Constancy
Principle
We have previously demonstrated (see Genzel and
Charniak (2002) for detailed derivation) that the
conditional entropy of the ith word in the sentence
(Xi), given its local context Li (the preceding words
in the same sentence) and global context Ci (the
words in all preceding sentences) can be represented
as
H(Xi|Ci,Li) = H(Xi|Li)−I(Xi,Ci|Li)
where H(Xi|Li) is the conditional entropy of the
ith word given local context, and I(Xi,Ci|Li) is
the conditional mutual information between the ith
word and out-of-sentence context, given the local
context. Since Ci increases with the sentence num-
ber, we will assume that, normally, it will provide
more and more information with each sentence. This
would cause the second term on the right to increase
with the sentence number, and since H(Xi|Ci,Li)
must remain constant (by our assumption), the first
term should increase with sentence number, and it
had been shown to do so (Genzel and Charniak,
2002).
Our assumption about the increase of the mutual
information term is, however, likely to break at the
paragraph boundary. If there is a topic shift at the
boundary, the context probably provides more infor-
mation to the preceding sentence, than it does to the
new one. Hence, the second term will decrease, and
so must the first one.
In the next section we will verify this experimen-
tally.
2.2 Experimental Setup
We use the Wall Street Journal text (years 1987-
1989) as our corpus. We take all articles that con-
tain ten or more sentences, and extract the first ten
sentences. Then we:
1. Group extracted sentences according to their
sentence number into ten sets of 49559 sen-
tences each.
2. Separate each set into two subsets, paragraph-
starting and non-paragraph-starting sentences1.
3. Combine first 45000 sentences from each set
into the training set and keep all remaining data
as 10 testing sets (19 testing subsets).
We use a simple smoothed trigram language model:
P(xi|x1 ...xi−1) ≈ P(xi|xi−2xi−1)
= λ1 ˆP(xi|xi−2xi−1)
+ λ2 ˆP(xi|xi−1)
+ (1−λ1 −λ2) ˆP(xi)
where λ1 and λ2 are the smoothing coefficients2,
and ˆP is a maximum likelihood estimate of the cor-
responding probability, e.g.,
ˆP(xi|xi−2xi−1) = C(xi−2xi−1xi)
C(xi−2xi−1)
where C(xi ...xj) is the number of times this se-
quence appears in the training data.
We then evaluate the resulting model on each of
the testing sets, computing per-word entropy of the
set:
ˆH(X) = 1
|X|
summationdisplay
xi∈X
logP(xi|xi−2xi−1)
1First sentences are, of course, all paragraph-starting.
2We have arbitrarily chosen the smoothing coefficients to be
0.5 and 0.3, correspondingly.
1 2 3 4 5 6 7 8 9 105.8
5.9
6.0
6.1
6.2
6.3
6.4
6.5
6.6
Sentence number
Entropy (bits)
all sentences
paragraph−starting
non−paragraph−starting
Figure 1: Entropy vs. Sentence number
2.3 Results and Discussion
As outlined above, we have ten testing sets, one for
each sentence number; each set (except for the first)
is split into two subsets: sentences that start a para-
graph, and sentences that do not. The results for full
sets, paragraph-starting subsets, and non-paragraph-
starting subsets are presented in Figure 1.
First, we can see that the the entropy for full
sets (solid line) is generally increasing. This re-
sult corresponds to the previously discussed effect
of entropy increasing with the sentence number. We
also see that for all sentence numbers the paragraph-
starting sentences have lower entropy than the non-
paragraph-starting ones, which is what we intended
to demonstrate. In such a way, the paragraph-
starting sentences are similar to the first sentences,
which makes intuitive sense.
All the lines roughly show that entropy increases
with the sentence number, but the behavior at the
second and the third sentences is somewhat strange.
We do not yet have a good explanation of this phe-
nomenon, except to point out that paragraphs that
start at the second or third sentences are probably
not “normal” because they most likely do not indi-
cate a topic shift. Another possible explanation is
that this effect is an artifact of the corpus used.
We have also tried to group sentences based on
their sentence number within paragraph, but were
unable to observe a significant effect. This may be
due to the decrease of this effect in the later sen-
tences of large articles, or perhaps due to the relative
weakness of the effect3.
3 Different Genres and Languages
3.1 Experiments on Fiction
3.1.1 Introduction
All the work on this problem so far has focused
on the Wall Street Journal articles. The results are
thus naturally suspect; perhaps the observed effect
is simply an artifact of the journalistic writing style.
To address this criticism, we need to perform com-
parable experiments on another genre.
Wall Street Journal is a fairly prototypical exam-
ple of a news article, or, more generally, a writing
with a primarily informative purpose. One obvious
counterpart of such a genre is fiction4. Another al-
ternative might be to use transcripts of spoken dia-
logue.
Unfortunately, works of fiction, are either non-
homogeneous (collections of works) or relatively
short with relatively long subdivisions (chapters).
This is crucial, since in the sentence number experi-
ments we obtain one data point per article, therefore
it is impossible to use book chapters in place of arti-
cles.
3.1.2 Experimental Setup and Results
For our experiments we use War and Peace (Tol-
stoy, 1869), since it is rather large and publicly avail-
able. It contains only about 365 rather long chap-
ters5. Unlike WSJ articles, each chapter is not writ-
ten on a single topic, but usually has multiple topic
shifts. These shifts, however, are marked only as
paragraph breaks. We, therefore, have to assume
that each paragraph break represents a topic shift,
3We combine into one set very heterogeneous data: both 1st
and 51st sentence might be in the same set, if they both start
a paragraph. The experiment in Section 2.2 groups only the
paragraph-starting sentences with the same sentence number.
4We use prose rather than poetry, which presumably is
even less informative, because poetry often has superficial con-
straints (meter); also, it is hard to find a large homogeneous
poetry collection.
5For comparison, Penn Treebank contains over 2400 (much
shorter) WSJ articles.
1 2 3 4 58.05
8.1
8.15
8.2
8.25
8.3
Sentence number since beginning of paragraph
Entropy in bits
Real run    Control runs
Figure 2: War and Peace: English
and treat each paragraph as being an equivalent of a
WSJ article, even though this is obviously subopti-
mal.
The experimental setup is very similar to the one
used in Section 2.2. We use roughly half of the data
for training purposes and split the rest into testing
sets, one per each sentence number, counted from
the beginning of a paragraph.
We then evaluate the results using the same
method as in Section 2.2. We expect that the en-
tropy would increase with the sentence number, just
as in the case of the sentences numbered from the
article boundary. This effect is present, but is not
very pronounced. To make sure that it is statistically
significant, we also do 1000 control runs for com-
parison, with paragraph breaks inserted randomly at
the appropriate rate. The results (including 3 ran-
dom runs) can be seen in Figure 2. To make sure
our results are significant we compare the correla-
tion coefficient between entropy and sentence num-
ber to ones from simulated runs, and find them to be
significant (P=0.016).
It is fairly clear that the variation, especially be-
tween the first and the later sentences, is greater
than it would be expected for a purely random oc-
currence. We will see further evidence for this in the
next section.
3.2 Experiments on Other Languages
To further verify that this effect is significant and
universal, it is necessary to do similar experiments
in other languages. Luckily, War and Peace is also
digitally available in other languages, of which we
pick Russian and Spanish for our experiments.
We follow the same experimental procedure as in
Section 3.1.2 and obtain the results for Russian (Fig-
ure 3(a)) and Spanish (Figure 3(b)). We see that re-
sults are very similar to the ones we obtained for
English. The results are again significant for both
Russian (P=0.004) and Spanish (P=0.028).
3.3 Influence of Genre on the Strength of the
Effect
We have established that entropy increases with the
sentence number in the works of fiction. We ob-
serve, however, that the effect is smaller than re-
ported in our previous work (Genzel and Charniak,
2002) for Wall Street Journal articles. This is to be
expected, since business and news writing tends to
be more structured and informative in nature, grad-
ually introducing the reader to the topic. Context,
therefore, plays greater role in this style of writing.
To further investigate the influence of genre and
style on the strength of the effect we perform exper-
iments on data from British National Corpus (Leech,
1992) which is marked by genre.
For each genre, we extract first ten sentences of
each genre subdivision of ten or more sentences.
90% of this data is used as training data and 10%
as testing data. Testing data is separated into ten
sets: all the first sentences, all the second sentences,
and so on. We then use a trigram model trained on
the training data set to find the average per-word en-
tropy for each set. We obtain ten numbers, which
in general tend to increase with the sentence num-
ber. To find the degree to which they increase, we
compute the correlation coefficient between the en-
tropy estimates and the sentence numbers. We report
these coefficients for some genres in Table 1. To en-
sure reliability of results we performed the described
process 400 times for each genre, sampling different
testing sets.
The results are very interesting and strongly sup-
port our assumption that informative and struc-
tured (and perhaps better-written) genres will have
1 2 3 4 59.2
9.3
9.4
9.5
9.6
9.7
9.8
Sentence number since beginning of paragraph
Entropy in bits
Real run    Control runs
(a) Russian
1 2 3 4 58.2
8.25
8.3
8.35
8.4
8.45
8.5
8.55
8.6
8.65
Entropy in bits
Sentence number since beginning of paragraph
Real run    Control runs
(b) Spanish
Figure 3: War and Peace
stronger correlations between entropy and sentence
number. There is only one genre, tabloid newspa-
pers6, that has negative correlation. The four gen-
res with the smallest correlation are all quite non-
informative: tabloids, popular magazines, advertise-
ments7 and poetry. Academic writing has higher
correlation coefficients than non-academic. Also,
humanities and social sciences writing is probably
more structured and better stylistically than science
and engineering writing. At the bottom of the table
we have genres which tend to be produced by pro-
fessional writers (biography), are very informative
(TV news feed) or persuasive and rhetorical (parlia-
mentary proceedings).
3.4 Conclusions
We have demonstrated that paragraph boundaries of-
ten cause the entropy to decrease, which seems to
support the Entropy Rate Constancy principle. The
effects are not very large, perhaps due to the fact
6Perhaps, in this case the readers are only expected to look
at the headlines.
7Advertisements could be called informative, but they tend
to be sets of loosely related sentences describing various fea-
tures, often in no particular order.
that each new paragraph does not necessarily rep-
resent a shift of topic. This is especially true in a
medium like the Wall Street Journal, where articles
are very focused and tend to stay on one topic. In
fiction, paragraphs are often used to mark a topic
shift, but probably only a small proportion of para-
graph breaks in fact represents topic shifts. We also
observed that more informative and structured writ-
ing is subject to stronger effect than speculative and
imaginative one, but the effect is present in almost
all writing.
In the next section we will discuss the potential
causes of the entropy results presented both in the
preceding and this work.
4 Investigating Non-Lexical Causes
In our previous work we discuss potential causes
of the entropy increase. We find that both lexical
(which words are used) and non-lexical (how the
words are used) causes are present. In this section
we will discuss possible non-lexical causes.
We know that some non-lexical causes are
present. The most natural way to find these causes is
to examine the parse trees of the sentences. There-
fore, we collect a number of statistics on the parse
BNC genre Corr. coef.
Tabloid newspapers −0.342±0.014
Popular magazines 0.073±0.016
Print advertisements 0.175±0.015
Fiction: poetry 0.261±0.013
Religious texts 0.328±0.012
Newspapers: commerce/finance 0.365±0.013
Non-acad: natural sciences 0.371±0.012
Official documents 0.391±0.012
Fiction: prose 0.409±0.011
Non-acad: medicine 0.411±0.013
Newspapers: sports 0.433±0.047
Acad: natural sciences 0.445±0.010
Non-acad: tech, engineering 0.478±0.011
Non-acad: politics, law, educ. 0.512±0.004
Acad: medicine 0.517±0.007
Acad: tech, engineering 0.521±0.010
Newspapers: news reportage 0.541±0.009
Non-acad: social sciences 0.541±0.008
Non-acad: humanities 0.598±0.007
Acad: politics, laws, educ. 0.619±0.006
Newspapers: miscellaneous 0.622±0.009
Acad: humanities 0.676±0.007
Commerce/finance, economics 0.678±0.007
Acad: social sciences 0.688±0.004
Parliamentary proceedings 0.774±0.002
TV news script 0.850±0.002
Biographies 0.894±0.001
Table 1: Correlation coefficient for different genres
trees and investigate if any statistics show a signifi-
cant change with the sentence number.
4.1 Experimental Setup
We use the whole Penn Treebank corpus (Marcus et
al., 1993) as our data set. This corpus contains about
50000 parsed sentences.
Many of the statistics we wish to compute are very
sensitive to the length of the sentence. For example,
the depth of the tree is almost linearly related to the
sentence length. This is important because the aver-
age length of the sentence varies with the sentence
number. To make sure we exclude the effect of the
sentence length, we need to normalize for it.
We proceed in the following way. Let T be the set
of trees, and f : T → R be some statistic of a tree.
Let l(t) be the length of the underlying sentence for
0 2 4 6 8 100.985
0.99
0.995
1
1.005
1.01
1.015
Bucket number (for sentence number)
Adjusted tree depth
Figure 4: Tree Depth
tree t. Let L(n) = {t|l(t) = n} be the set of trees of
size n. Let Lf(n) be defined as 1|L(n)|summationtextt∈L(n) f(t),
the average value of the statistic f on all sentences
of length n. We then define the sentence-length-
adjusted statistic, for all t, as
fprime(t) = f(t)L
f(l(t))
The average value of the adjusted statistic is now
equal to 1, and it is independent of the sentence
length.
We can now report the average value of each
statistic for each sentence number, as we have done
before, but instead we will group the sentence num-
bers into a small number of “buckets” of exponen-
tially increasing length8. We do so to capture the
behavior for all the sentence numbers, and not just
for the first ten (as before), as well as to lump to-
gether sentences with similar sentence numbers, for
which we do not expect much variation.
4.2 Tree Depth
The first statistic we consider is also the most nat-
ural: tree depth. The results can be seen in Figure
4.
In the first part of the graph we observe an in-
crease in tree depth, which is consistent with the in-
creasing complexity of the sentences. In the later
8For sentence number n we compute the bucket number as
floorleftlog1.5 nfloorright
0 2 4 6 8 100.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
Bucket number (for sentence number)
Adjusted branching factor
Branching factor
NPs only        
Base NPs only   
Figure 5: Branching factor
sentences, the depth decreases slightly, but still stays
above the depth of the first few sentences.
4.3 Branching Factor and NP Size
Another statistic we investigate is the average
branching factor, defined as the average number of
children of all non-leaf nodes in the tree. It does
not appear to be directly correlated with the sentence
length, but we normalize it to make sure it is on the
same scale, so we can compare the strength of re-
sulting effect.
Again, we expect lower entropy to correspond to
flatter trees, which corresponds to large branching
factor. Therefore we expect the branching factor to
decrease with the sentence number, which is indeed
what we observe (Figure 5, solid line).
Each non-leaf node contributes to the average
branching factor. It is likely, however, that the
branching factor changes with the sentence num-
ber for certain types of nodes only. The most obvi-
ous contributors for this effect seem to be NP (noun
phrase) nodes. Indeed, one is likely to use several
words to refer to an object for the first time, but only
a few words (even one, e.g., a pronoun) when refer-
ring to it later. We verify this intuitive suggestion,
by computing the branching factor for NP, VP (verb
phrase) and PP (prepositional phrase) nodes. Only
NP nodes show the effect, and it is much stronger
(Figure 5, dashed line) than the effect for the branch-
0 2 4 6 8 100.98
0.99
1
1.01
1.02
1.03
1.04
1.05
Bucket number (for sentence number)
Adjusted branching factor
Branching factor             Branching factor w/o base NPs
Figure 6: Branching Factor without Base NPs
ing factor.
Furthermore, it is natural to expect that most of
this effect arises from base NPs, which are defined
as the NP nodes whose children are all leaf nodes.
Indeed, base NPs show a slightly more pronounced
effect, at least with regard to the first sentence (Fig-
ure 5, dotted line).
4.4 Further Investigations
We need to determine whether we have accounted
for all of the branching factor effect, by proposing
that it is simply due to decrease in the size of the base
NPs. To check, we compute the average branching
factor, excluding base NP nodes.
By comparing the solid line in Figure 6 (the origi-
nal average branching factor result) with the dashed
line (base NPs excluded), you can see that base NPs
account for most, though not all of the effect. It
seems, then, that this problem requires further in-
vestigation.
4.5 Gapping
Another potential reason for the increase in the sen-
tence complexity might be the increase in the use of
gapping. We investigate whether the number of the
ellipsis constructions varies with the sentence num-
ber. We again use Penn Treebank for this experi-
0 2 4 6 8 100.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
Bucket number (for sentence number)
Adjusted number of gaps
Figure 7: Number of ellipsis nodes
ment9.
As we can see from Figure 7, there is indeed a sig-
nificant increase in the use of ellipsis as the sentence
number increases, which presumably makes the sen-
tences more complex. Only about 1.5% of all the
sentences, however, have gaps.
5 Future Work and Conclusions
We have discovered a number of interesting facts
about the variation of sentences with the sentence
number. It has been previously known that the com-
plexity of the sentences increases with the sentence
number. We have shown here that the complexity
tends to decrease at the paragraph breaks in accor-
dance with the Entropy Rate Constancy principle.
We have verified that entropy also increases with the
sentence number outside of Wall Street Journal do-
main by testing it on a work of fiction. We have also
verified that it holds for languages other than En-
glish. We have found that the strength of the effect
depends on the informativeness of a genre.
We also looked at the various statistics that show
a significant change with the sentence number, such
as the tree depth, the branching factor, the size of
noun phrases, and the occurrence of gapping.
Unfortunately, we have been unable to apply these
results successfully to any practical problem so far,
9Ellipsis nodes in Penn Treebank are marked with *?* .
See Bies et al. (1995) for details.
primarily because the effects are significant on av-
erage and not in any individual instances. Finding
applications of these results is the most important
direction for future research.
Also, since this paper essentially makes state-
ments about human processing, it would be very ap-
propriate to to verify the Entropy Rate Constancy
principle by doing reading time experiments on hu-
man subjects.
6 Acknowledgments
We would like to acknowledge the members of the
Brown Laboratory for Linguistic Information Pro-
cessing and particularly Mark Johnson for many
useful discussions. This research has been supported
in part by NSF grants IIS 0085940, IIS 0112435, and
DGE 9870676.

References
A. Bies, M. Ferguson, K. Katz, and R. MacIntyre, 1995.
Bracketing Guidelines for Treebank II Style Penn Tree-
bank Project. Penn Treebank Project, University of
Pennsylvania.
D. Genzel and E. Charniak. 2002. Entropy rate con-
stancy in text. In Proceedings of ACL–2002, Philadel-
phia.
J. T. Goodman. 2001. A bit of progress in language mod-
eling. Computer Speech and Language, 15:403–434.
G. Leech. 1992. 100 million words of English:
the British National Corpus. Language Research,
28(1):1–13.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1993. Building a large annotated corpus of En-
glish: the Penn treebank. Computational Linguistics,
19:313–330.
J. B. Plotkin and M. A. Nowak. 2000. Language evo-
lution and information theory. Journal of Theoretical
Biology, pages 147–159.
C. E. Shannon. 1948. A mathematical theory of commu-
nication. The Bell System Technical Journal, 27:379–
423, 623–656, July, October.
L. Tolstoy. 1869. War and Peace. Available online,
in 4 languages (Russian, English, Spanish, Italian):
http://www.magister.msk.ru/library/tolstoy/wp/wp00.htm.
