File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3241_metho.xml

Size: 19,278 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3241">
  <Title>The Entropy Rate Principle as a Predictor of Processing Effort: An Evaluation against Eye-tracking Data</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Predictions for Human Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Processing
</SectionTitle>
      <Paragraph position="0"> Let us examine the psycholinguistic predictions of G&amp;C's entropy rate principle in more detail. We need to distinguish two types of predictions: in-context predictions and out-of-context predictions.</Paragraph>
      <Paragraph position="1"> The principle states that the entropy rate in a text is constant, i.e., that speakers produce sentences so that on average, all sentences in a text have the same entropy. In other words, communication is optimal in the sense that all sentences in the text are equally easy to understand, as they all have the same entropy. null This constancy principle is claimed to hold for connected text: all sentences in a text should be equally easy to process if they are presented in context. If we take reading time as a measure of processing effort, then the principle predicts that there should be no significant correlation between sentence position and reading time in context. We will test this prediction in Experiment 2 using an eye-tracking corpus consisting of connected text.</Paragraph>
      <Paragraph position="2"> The entropy rate principle also makes the following prediction: if the entropy of a sentence is measured out of context (i.e., without taking the preceding sentences into account), then entropy will increase with sentence position. This prediction was tested extensively by G&amp;C, whose results will be replicated in Experiment 1. With respect to processing difficulty, the entropy rate principle also predicts that processing difficulty out of context (i.e., if isolated sentences are presented to experimental subjects) should increase with sentence position. We could not test this prediction, as we only had in-context reading time data available for the present study.</Paragraph>
      <Paragraph position="3"> However, there is another important prediction that can be derived from the entropy rate principle: sentences with a higher entropy should have higher reading times. This is an important precondition for the entropy rate principle, whose claims about the relationship between entropy and sentence position are only meaningful if entropy and processing effort are correlated. If there was no such correlation, then there would be no reason to assume that the out-of-context entropy of a sentence increases with sentence position. G&amp;C explicitly refer to this relationship i.e., they assume that a sentence that is more informative is harder to process (Genzel and Charniak, 2003, p. 65). Experiment 1 will try to demonstrate the validity of this important prerequisite of the entropy rate principle.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiment 1: Entropy Rate and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence Length
</SectionTitle>
      <Paragraph position="0"> The main aim of this experiment was to replicate G&amp;C's entropy rate effect. A second aim was to test the generality of their result by determining if the relationship between sentence position and entropy also holds for individual sentences (rather than for averages over sentences of a given position, as tested by G&amp;C). We also investigated the effect of two parameters that G&amp;C did not explore: the cut-off for article position (G&amp;C only deal with sentences up to position 25), and the size of the n-gram used for estimating sentence probability. Finally, we include sentence length as a baseline that entropy-based models should be evaluated against.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Method
3.1.1 Materials
</SectionTitle>
      <Paragraph position="0"> This experiment used the same corpus as Genzel and Charniak (2002), viz., the Wall Street Journal part of the Penn Treebank, divided into a training set (section 0-20) and a test set (sections 21-24).</Paragraph>
      <Paragraph position="1"> Each article was treated as a separate text, and sentence positions were computed by counting the sentences from the beginning of the text. The training set contained 42,075 sentences, the test set 7,133 sentences. The sentence positions in the test set varied between one and 149.</Paragraph>
      <Paragraph position="2">  The per-word entropy was computed using an n-gram language model, as proposed by G&amp;C:1</Paragraph>
      <Paragraph position="4"> Here, ^H(X) is the estimate of the per-word entropy of the sentence X, consisting of the words xi, and n is the size of the n-gram. The n-gram probabilities were computed using the CMU-Cambridge language modeling toolkit (Clarkson and Rosenfeld, 1997), with the following parameters: vocabulary size 50,000; smoothing by absolute discounting; sentence beginning and sentence end as context cues (default values were used for all other parameters). null G&amp;C use n = 3, i.e., a trigram model. We experimented with this parameter and used n = 1; : : :;5. For n = 1, equation (1) reduces to ^H(X) = 1jXj [?]xi2X logP(xi), i.e., a model based on word frequency.</Paragraph>
      <Paragraph position="5"> The experiment also includes a simple model that does not take any probabilistic information into account, but simply uses the sentence length jXj to predict sentence position. This model will serve as the baseline.</Paragraph>
      <Paragraph position="6"> 1Note that the original definition given by Genzel and Charniak (2002, 2003) does not include the minus sign. However, all their graphs display entropy as a positive quantity, hence we conclude that this is the definition they are using.</Paragraph>
      <Paragraph position="7">  tropy and sentence position (bins, 3-grams, cut-off 76) We also vary another parameter: c, the cut-off for the position. Genzel and Charniak (2002) use c = 25, i.e., only sentences with a position of 25 or lower are considered. In Genzel and Charniak (2003), an even smaller cut-off of c = 10 is used. This severely restricts the generality of the results obtained. We will therefore report results not only for c = 25, but also for c = 76. This cut-off has been set so that there are at least 10 items in the test set for each position. Furthermore, we also repeated the experiment without a cut-off for sentence length.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the results for the replication of Genzel and Charniak's (2002) entropy rate effect. The results at the top of the table were obtained using binning, i.e., we computed the mean entropy of all sentences of a given position, and then correlated these mean entropies with the sentence positions.</Paragraph>
      <Paragraph position="1"> The parameters n (n-gram size) and c (cut-off value) were varied as indicated in the previous section.</Paragraph>
      <Paragraph position="2"> The bottom of Table 1 gives the correlation coefficients computed on the raw data, i.e., without binning: here, we correlated the entropy of a given sentence directly with its position. The graphs in Figure 1 and Figure 2 illustrate the relationship between position and entropy and between position and length, respectively.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> 3.3.1 Entropy Rate and Sentence Length The results displayed in Table 1 confirm G&amp;C's main finding, i.e., that entropy increases with sentence length. For a cut-off of c = 25 (as used by G&amp;C), a maximum correlation of 0:6480 is obtained (for the 4-gram model). The correlations for the other n-gram models are lower. All correlations  length and sentence position (bins, cut-off 76) are significant (with the exception of the unigram model). However, we also find that a substantial correlation of 0:4607 is obtained even for the base-line model: there is a negative correlation between sentence length and sentence position, i.e., longer sentences tend to occur earlier in the text. This finding potentially undermines the entropy rate effect, as it raises the possibility that this effect is simply an effect of sentence length, rather than of sentence entropy. Note that the correlation coefficient for the none of the n-gram models is significantly higher than the baseline (significance was computed on the absolute values of the correlation coefficients).</Paragraph>
      <Paragraph position="1"> The second finding concerns the question whether the entropy rate effect generalizes to sentences with a position of greater than 25. The results in Table 1 show that the effect generalizes to a cut-off of c = 76 (recall that this value was chosen so that each position is represented at least ten times in the test data). Again, we find a significant correlation between entropy and sentence position for all values of n. This is illustrated in Figure 1.</Paragraph>
      <Paragraph position="2"> However, none of the n-gram models is able to beat the baseline of simple sentence position; in fact, now all models (with the exception of the unigram model) perform significantly worse than the baseline. The correlation obtained by the baseline model is graphed in Figure 2.</Paragraph>
      <Paragraph position="3"> Finally, we tried to generalize the entropy rate effect to sentences with arbitrary position (no cut-off). Here, we find that there is no significant positive correlation between entropy and position for any of the n-gram models. Only sentence length yields a reliable correlation, though it is smaller than if a cut-off is applied. This result is perhaps not surprising, as a lot of the data is very sparse: for positions between 77 and 149, less than ten data points are  (sentence length) available per position. Based on data this sparse, no reliable correlation coefficients can be expected.</Paragraph>
      <Paragraph position="4"> Let us now turn to Table 1, which displays the results that were obtained by computing correlation coefficients on the raw data, i.e., without computing the mean entropy for all sentences with the same position. We find that for all parameter settings a significant correlation between sentence entropy and sentence position is obtained (with the exception of n = 1, c = 25). The correlation coefficients are significantly lower than the ones obtained using binning, the highest coefficient is 0:0830. This means that a small but reliable entropy effect can be observed even on the raw data, i.e., for individual sentences rather than for bins of sentences with the same position.</Paragraph>
      <Paragraph position="5"> However, the results in Table 1 also confirm our findings regarding the baseline model (simple sentence length): in all cases the correlation coefficient achieved for the baseline is higher than the one achieved by the entropy models, in some cases even significantly so.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3.2 Disconfounding Entropy and Sentence
Length
</SectionTitle>
      <Paragraph position="0"> Taken together, the results in Table 1 seem to indicate that the entropy rate effect reported by G&amp;C is not really an effect of entropy, but just an effect of sentence length. The effect seems to be due to the fact that G&amp;C compute entropy rate by dividing the entropy of a sentence by its length: sentence length is correlated with sentence position, hence entropy rate will be correlated with position as well.</Paragraph>
      <Paragraph position="1"> It is therefore necessary to conduct additional analyses that remove the confound of sentence length. This can be achieved by computing partial correlations; the partial correlation coefficient between a factor 1 and a factor 2 expresses the degree of association between the factors that is left once the influence of a third factor has been removed from both factors. For example, we can compute the correlation of position and entropy, with sentence length partialled out. This will tell us use the amount of association between position and entropy that is left once the influence of length has been removed from both position and entropy.</Paragraph>
      <Paragraph position="2"> Table 2 shows the results of partial correlation analyses for length and entropy. Note that these results were obtained using total entropy, not per-word entropy, i.e., the normalizing term 1jXj was dropped from (1). The partial correlations are only reported for the trigram model.</Paragraph>
      <Paragraph position="3"> The results indicate that entropy is a significant predictor sentence position, even once sentence length has been partialled out. This result holds for both the binned data and the raw data, and for all cut-offs (with the exception of c = 76 for the binned data). Note however, that entropy is always a worse predictor than sentence length; the absolute value of the correlation coefficient is always lower. This indicates that the entropy rate effect is a much weaker effect than the results presented by G&amp;C suggest.</Paragraph>
      <Paragraph position="4">  the other factor partialled out cant correlation between sentence entropy and sentence position, even when sentence length, which was shown to be a confounding factor, was controlled for. The effect, however, was smaller than claimed by G&amp;C, in particular when applied to individual sentences, as opposed to means obtained for sentences at the same position.</Paragraph>
      <Paragraph position="5"> In the present experiment, we will test a crucial aspect of the entropy rate principle, viz., that entropy should correlate with processing effort. We will test this using a corpus of newspaper text that is annotated with eye-tracking data. Eye-tracking measures of reading time are generally thought to reflect the amount of cognitive effort that is required for the processing of a given word or sentence.</Paragraph>
      <Paragraph position="6"> A second prediction of the entropy rate principle is that sentences with higher position should be harder to process than sentences with lower position. This relationship should hold out of context, but not in context (see Section 2).</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Method
</SectionTitle>
      <Paragraph position="0"> As a test corpus, we used the Embra corpus (Mc-Donald and Shillcock, 2003). This corpus consists of 10 articles from Scottish and UK national broadsheet newspapers. The excerpts cover a wide range of topics; they are slightly edited to make them compatible with eye-tracking.2 The length of the articles varies between 97 and 405 words, the total size of the corpus is 2,262 words (125 sentences). Twenty-three native speakers of English read all 10 articles while their eye-movements were recorded using a Dual-Purkinke Image eye-tracker. To make sure that subjects read the texts carefully, comprehension questions were also administered. For details on method used to create the Embra corpus, see McDonald and Shillcock (2003).</Paragraph>
      <Paragraph position="1"> The training and development sets for this experiment were compiled so as to match the test corpus in terms of genre. This was achieved by selecting 2This includes, e.g., the removal of quotation marks and brackets, which can disrupt the eye-movement record.</Paragraph>
      <Paragraph position="2"> all files from the British National Corpus (Burnard, 1995) that originate from UK national or regional broadsheet newspapers. This subset of the BNC was divided into a 90% training set and a 10% development set. This resulted in a training set consisting of 6,729,104 words (30,284 sentences), and a development set consisting of 746,717 words (34,269 sentences). The development set will be used to test if the entropy rate effect holds on this new corpus.</Paragraph>
      <Paragraph position="3"> The sentence positions in the test set varied between one and 24, in the development, they varied between one and 206.</Paragraph>
      <Paragraph position="4">  To compute per-word entropy, we trained n-gram models on the training set using the CMU-Cambridge language modeling toolkit, with the same parameters as in Experiment 1. Again, n was varied from 1 to 5. We determined the correlation between per-word entropy and sentence position for both the development set (derived from the BNC) and for the test set (the Embra corpus).</Paragraph>
      <Paragraph position="5"> Then, we investigated the predictions of G&amp;C's entropy rate principle by correlating the position and entropy of a sentence with its reading time in the Embra corpus.</Paragraph>
      <Paragraph position="6"> The reading measure used was total reading time, i.e., the total time it takes a subject to read a sentence; this includes second fixations and re-fixations of words. We also experimented with other reading measures such as gaze duration, first fixation time, second fixation time, regression duration, and skipping probability. However, the results obtained with these measures were similar to the ones obtained with total reading time, and will not be reported here.</Paragraph>
      <Paragraph position="7"> Total reading time is trivially correlated with sentence length (longer sentences taker longer to read). Hence we normalized total reading time by sentence length, i.e., by multiplying with the factor 1jXj, also used in the computation of per-word entropy.</Paragraph>
      <Paragraph position="8"> It is also well-known that reading time is correlated with two other factors: word length and word frequency; shorter and more frequent words take  less time to read (Just and Carpenter, 1980). We removed these confounding factors by conducting multiple regression analyses involving word length, word frequency, and the predictor variable (entropy or sentence position). The aim was to establish if there is a significant effect of entropy or sentence length, even when the other factors are controlled for. Word frequency was estimated using the uni-gram model trained on the training corpus.</Paragraph>
      <Paragraph position="9"> In the eye-tracking literature, it is generally recommended to run regression analyses on the reading times collected from individual subjects. In other words, it is not good practice to compute regressions on average reading times, as this fails take betweensubject variation in reading behavior into account, and leads to inflated correlation coefficients. We therefore followed the recommendations of Lorch and Myers (1990) for computing regressions without averaging over subjects (see also McDonald and Shillcock (2003) for details on this procedure).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML