File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1009_metho.xml
Size: 21,838 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1009"> <Title>Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Within-Paragraph Effects </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Implications of Entropy Rate Constancy Principle </SectionTitle> <Paragraph position="0"> We have previously demonstrated (see Genzel and Charniak (2002) for detailed derivation) that the conditional entropy of the ith word in the sentence (Xi), given its local context Li (the preceding words in the same sentence) and global context Ci (the words in all preceding sentences) can be represented as</Paragraph> <Paragraph position="2"> where H(Xi|Li) is the conditional entropy of the ith word given local context, and I(Xi,Ci|Li) is the conditional mutual information between the ith word and out-of-sentence context, given the local context. Since Ci increases with the sentence number, we will assume that, normally, it will provide more and more information with each sentence. This would cause the second term on the right to increase with the sentence number, and since H(Xi|Ci,Li) must remain constant (by our assumption), the first term should increase with sentence number, and it had been shown to do so (Genzel and Charniak, 2002).</Paragraph> <Paragraph position="3"> Our assumption about the increase of the mutual information term is, however, likely to break at the paragraph boundary. If there is a topic shift at the boundary, the context probably provides more information to the preceding sentence, than it does to the new one. Hence, the second term will decrease, and so must the first one.</Paragraph> <Paragraph position="4"> In the next section we will verify this experimentally. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Experimental Setup </SectionTitle> <Paragraph position="0"> We use the Wall Street Journal text (years 19871989) as our corpus. We take all articles that contain ten or more sentences, and extract the first ten sentences. Then we: 1. Group extracted sentences according to their sentence number into ten sets of 49559 sentences each.</Paragraph> <Paragraph position="1"> 2. Separate each set into two subsets, paragraph-starting and non-paragraph-starting sentences1. 3. Combine first 45000 sentences from each set into the training set and keep all remaining data as 10 testing sets (19 testing subsets).</Paragraph> <Paragraph position="2"> We use a simple smoothed trigram language model:</Paragraph> <Paragraph position="4"> where l1 and l2 are the smoothing coefficients2, and ^P is a maximum likelihood estimate of the corresponding probability, e.g.,</Paragraph> <Paragraph position="6"> where C(xi ...xj) is the number of times this sequence appears in the training data.</Paragraph> <Paragraph position="7"> We then evaluate the resulting model on each of the testing sets, computing per-word entropy of the set:</Paragraph> <Paragraph position="9"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Results and Discussion </SectionTitle> <Paragraph position="0"> As outlined above, we have ten testing sets, one for each sentence number; each set (except for the first) is split into two subsets: sentences that start a paragraph, and sentences that do not. The results for full sets, paragraph-starting subsets, and non-paragraphstarting subsets are presented in Figure 1.</Paragraph> <Paragraph position="1"> First, we can see that the the entropy for full sets (solid line) is generally increasing. This result corresponds to the previously discussed effect of entropy increasing with the sentence number. We also see that for all sentence numbers the paragraph-starting sentences have lower entropy than the nonparagraph-starting ones, which is what we intended to demonstrate. In such a way, the paragraph-starting sentences are similar to the first sentences, which makes intuitive sense.</Paragraph> <Paragraph position="2"> All the lines roughly show that entropy increases with the sentence number, but the behavior at the second and the third sentences is somewhat strange.</Paragraph> <Paragraph position="3"> We do not yet have a good explanation of this phenomenon, except to point out that paragraphs that start at the second or third sentences are probably not &quot;normal&quot; because they most likely do not indicate a topic shift. Another possible explanation is that this effect is an artifact of the corpus used.</Paragraph> <Paragraph position="4"> We have also tried to group sentences based on their sentence number within paragraph, but were unable to observe a significant effect. This may be due to the decrease of this effect in the later sentences of large articles, or perhaps due to the relative weakness of the effect3.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Different Genres and Languages </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experiments on Fiction 3.1.1 Introduction </SectionTitle> <Paragraph position="0"> All the work on this problem so far has focused on the Wall Street Journal articles. The results are thus naturally suspect; perhaps the observed effect is simply an artifact of the journalistic writing style.</Paragraph> <Paragraph position="1"> To address this criticism, we need to perform comparable experiments on another genre.</Paragraph> <Paragraph position="2"> Wall Street Journal is a fairly prototypical example of a news article, or, more generally, a writing with a primarily informative purpose. One obvious counterpart of such a genre is fiction4. Another alternative might be to use transcripts of spoken dialogue. null Unfortunately, works of fiction, are either non-homogeneous (collections of works) or relatively short with relatively long subdivisions (chapters).</Paragraph> <Paragraph position="3"> This is crucial, since in the sentence number experiments we obtain one data point per article, therefore it is impossible to use book chapters in place of articles. null For our experiments we use War and Peace (Tolstoy, 1869), since it is rather large and publicly available. It contains only about 365 rather long chapters5. Unlike WSJ articles, each chapter is not written on a single topic, but usually has multiple topic shifts. These shifts, however, are marked only as paragraph breaks. We, therefore, have to assume that each paragraph break represents a topic shift, 3We combine into one set very heterogeneous data: both 1st and 51st sentence might be in the same set, if they both start a paragraph. The experiment in Section 2.2 groups only the paragraph-starting sentences with the same sentence number. 4We use prose rather than poetry, which presumably is even less informative, because poetry often has superficial constraints (meter); also, it is hard to find a large homogeneous poetry collection.</Paragraph> <Paragraph position="4"> WSJ article, even though this is obviously suboptimal. null The experimental setup is very similar to the one used in Section 2.2. We use roughly half of the data for training purposes and split the rest into testing sets, one per each sentence number, counted from the beginning of a paragraph.</Paragraph> <Paragraph position="5"> We then evaluate the results using the same method as in Section 2.2. We expect that the entropy would increase with the sentence number, just as in the case of the sentences numbered from the article boundary. This effect is present, but is not very pronounced. To make sure that it is statistically significant, we also do 1000 control runs for comparison, with paragraph breaks inserted randomly at the appropriate rate. The results (including 3 random runs) can be seen in Figure 2. To make sure our results are significant we compare the correlation coefficient between entropy and sentence number to ones from simulated runs, and find them to be significant (P=0.016).</Paragraph> <Paragraph position="6"> It is fairly clear that the variation, especially between the first and the later sentences, is greater than it would be expected for a purely random occurrence. We will see further evidence for this in the next section.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Experiments on Other Languages </SectionTitle> <Paragraph position="0"> To further verify that this effect is significant and universal, it is necessary to do similar experiments in other languages. Luckily, War and Peace is also digitally available in other languages, of which we pick Russian and Spanish for our experiments.</Paragraph> <Paragraph position="1"> We follow the same experimental procedure as in Section 3.1.2 and obtain the results for Russian (Figure 3(a)) and Spanish (Figure 3(b)). We see that results are very similar to the ones we obtained for English. The results are again significant for both Russian (P=0.004) and Spanish (P=0.028).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Influence of Genre on the Strength of the Effect </SectionTitle> <Paragraph position="0"> We have established that entropy increases with the sentence number in the works of fiction. We observe, however, that the effect is smaller than reported in our previous work (Genzel and Charniak, 2002) for Wall Street Journal articles. This is to be expected, since business and news writing tends to be more structured and informative in nature, gradually introducing the reader to the topic. Context, therefore, plays greater role in this style of writing.</Paragraph> <Paragraph position="1"> To further investigate the influence of genre and style on the strength of the effect we perform experiments on data from British National Corpus (Leech, 1992) which is marked by genre.</Paragraph> <Paragraph position="2"> For each genre, we extract first ten sentences of each genre subdivision of ten or more sentences.</Paragraph> <Paragraph position="3"> 90% of this data is used as training data and 10% as testing data. Testing data is separated into ten sets: all the first sentences, all the second sentences, and so on. We then use a trigram model trained on the training data set to find the average per-word entropy for each set. We obtain ten numbers, which in general tend to increase with the sentence number. To find the degree to which they increase, we compute the correlation coefficient between the entropy estimates and the sentence numbers. We report these coefficients for some genres in Table 1. To ensure reliability of results we performed the described process 400 times for each genre, sampling different testing sets.</Paragraph> <Paragraph position="4"> The results are very interesting and strongly support our assumption that informative and structured (and perhaps better-written) genres will have stronger correlations between entropy and sentence number. There is only one genre, tabloid newspapers6, that has negative correlation. The four genres with the smallest correlation are all quite noninformative: tabloids, popular magazines, advertisements7 and poetry. Academic writing has higher correlation coefficients than non-academic. Also, humanities and social sciences writing is probably more structured and better stylistically than science and engineering writing. At the bottom of the table we have genres which tend to be produced by professional writers (biography), are very informative (TV news feed) or persuasive and rhetorical (parliamentary proceedings).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Conclusions </SectionTitle> <Paragraph position="0"> We have demonstrated that paragraph boundaries often cause the entropy to decrease, which seems to support the Entropy Rate Constancy principle. The effects are not very large, perhaps due to the fact 6Perhaps, in this case the readers are only expected to look at the headlines.</Paragraph> <Paragraph position="1"> 7Advertisements could be called informative, but they tend to be sets of loosely related sentences describing various features, often in no particular order.</Paragraph> <Paragraph position="2"> that each new paragraph does not necessarily represent a shift of topic. This is especially true in a medium like the Wall Street Journal, where articles are very focused and tend to stay on one topic. In fiction, paragraphs are often used to mark a topic shift, but probably only a small proportion of paragraph breaks in fact represents topic shifts. We also observed that more informative and structured writing is subject to stronger effect than speculative and imaginative one, but the effect is present in almost all writing.</Paragraph> <Paragraph position="3"> In the next section we will discuss the potential causes of the entropy results presented both in the preceding and this work.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Investigating Non-Lexical Causes </SectionTitle> <Paragraph position="0"> In our previous work we discuss potential causes of the entropy increase. We find that both lexical (which words are used) and non-lexical (how the words are used) causes are present. In this section we will discuss possible non-lexical causes.</Paragraph> <Paragraph position="1"> We know that some non-lexical causes are present. The most natural way to find these causes is to examine the parse trees of the sentences. Therefore, we collect a number of statistics on the parse trees and investigate if any statistics show a significant change with the sentence number.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We use the whole Penn Treebank corpus (Marcus et al., 1993) as our data set. This corpus contains about 50000 parsed sentences.</Paragraph> <Paragraph position="1"> Many of the statistics we wish to compute are very sensitive to the length of the sentence. For example, the depth of the tree is almost linearly related to the sentence length. This is important because the average length of the sentence varies with the sentence number. To make sure we exclude the effect of the sentence length, we need to normalize for it.</Paragraph> <Paragraph position="2"> We proceed in the following way. Let T be the set of trees, and f : T - R be some statistic of a tree.</Paragraph> <Paragraph position="3"> Let l(t) be the length of the underlying sentence for the average value of the statistic f on all sentences of length n. We then define the sentence-lengthadjusted statistic, for all t, as</Paragraph> <Paragraph position="5"> The average value of the adjusted statistic is now equal to 1, and it is independent of the sentence length.</Paragraph> <Paragraph position="6"> We can now report the average value of each statistic for each sentence number, as we have done before, but instead we will group the sentence numbers into a small number of &quot;buckets&quot; of exponentially increasing length8. We do so to capture the behavior for all the sentence numbers, and not just for the first ten (as before), as well as to lump together sentences with similar sentence numbers, for which we do not expect much variation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Tree Depth </SectionTitle> <Paragraph position="0"> The first statistic we consider is also the most natural: tree depth. The results can be seen in Figure In the first part of the graph we observe an increase in tree depth, which is consistent with the increasing complexity of the sentences. In the later sentences, the depth decreases slightly, but still stays above the depth of the first few sentences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Branching Factor and NP Size </SectionTitle> <Paragraph position="0"> Another statistic we investigate is the average branching factor, defined as the average number of children of all non-leaf nodes in the tree. It does not appear to be directly correlated with the sentence length, but we normalize it to make sure it is on the same scale, so we can compare the strength of resulting effect.</Paragraph> <Paragraph position="1"> Again, we expect lower entropy to correspond to flatter trees, which corresponds to large branching factor. Therefore we expect the branching factor to decrease with the sentence number, which is indeed what we observe (Figure 5, solid line).</Paragraph> <Paragraph position="2"> Each non-leaf node contributes to the average branching factor. It is likely, however, that the branching factor changes with the sentence number for certain types of nodes only. The most obvious contributors for this effect seem to be NP (noun phrase) nodes. Indeed, one is likely to use several words to refer to an object for the first time, but only a few words (even one, e.g., a pronoun) when referring to it later. We verify this intuitive suggestion, by computing the branching factor for NP, VP (verb phrase) and PP (prepositional phrase) nodes. Only NP nodes show the effect, and it is much stronger (Figure 5, dashed line) than the effect for the branch- null Furthermore, it is natural to expect that most of this effect arises from base NPs, which are defined as the NP nodes whose children are all leaf nodes.</Paragraph> <Paragraph position="3"> Indeed, base NPs show a slightly more pronounced effect, at least with regard to the first sentence (Figure 5, dotted line).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Further Investigations </SectionTitle> <Paragraph position="0"> We need to determine whether we have accounted for all of the branching factor effect, by proposing that it is simply due to decrease in the size of the base NPs. To check, we compute the average branching factor, excluding base NP nodes.</Paragraph> <Paragraph position="1"> By comparing the solid line in Figure 6 (the original average branching factor result) with the dashed line (base NPs excluded), you can see that base NPs account for most, though not all of the effect. It seems, then, that this problem requires further investigation. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Gapping </SectionTitle> <Paragraph position="0"> Another potential reason for the increase in the sentence complexity might be the increase in the use of gapping. We investigate whether the number of the ellipsis constructions varies with the sentence number. We again use Penn Treebank for this experi- null ment9.</Paragraph> <Paragraph position="1"> As we can see from Figure 7, there is indeed a significant increase in the use of ellipsis as the sentence number increases, which presumably makes the sentences more complex. Only about 1.5% of all the sentences, however, have gaps.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Future Work and Conclusions </SectionTitle> <Paragraph position="0"> We have discovered a number of interesting facts about the variation of sentences with the sentence number. It has been previously known that the complexity of the sentences increases with the sentence number. We have shown here that the complexity tends to decrease at the paragraph breaks in accordance with the Entropy Rate Constancy principle.</Paragraph> <Paragraph position="1"> We have verified that entropy also increases with the sentence number outside of Wall Street Journal domain by testing it on a work of fiction. We have also verified that it holds for languages other than English. We have found that the strength of the effect depends on the informativeness of a genre.</Paragraph> <Paragraph position="2"> We also looked at the various statistics that show a significant change with the sentence number, such as the tree depth, the branching factor, the size of noun phrases, and the occurrence of gapping.</Paragraph> <Paragraph position="3"> Unfortunately, we have been unable to apply these results successfully to any practical problem so far, 9Ellipsis nodes in Penn Treebank are marked with *?* .</Paragraph> <Paragraph position="4"> See Bies et al. (1995) for details.</Paragraph> <Paragraph position="5"> primarily because the effects are significant on average and not in any individual instances. Finding applications of these results is the most important direction for future research.</Paragraph> <Paragraph position="6"> Also, since this paper essentially makes statements about human processing, it would be very appropriate to to verify the Entropy Rate Constancy principle by doing reading time experiments on human subjects.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Acknowledgments </SectionTitle> <Paragraph position="0"> We would like to acknowledge the members of the Brown Laboratory for Linguistic Information Processing and particularly Mark Johnson for many useful discussions. This research has been supported in part by NSF grants IIS 0085940, IIS 0112435, and</Paragraph> </Section> class="xml-element"></Paper>