File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1040_metho.xml
Size: 17,121 bytes
Last Modified: 2025-10-06 14:14:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1040"> <Title>The Rhythm of Lexical Stress in Prose</Title> <Section position="3" start_page="302" end_page="303" type="metho"> <SectionTitle> 2 Stress entropy rate </SectionTitle> <Paragraph position="0"> We regard every syllable as having either strong or weak stress, and we employ a purely lexical, context independent mapping, a pronunciation dictionary a, to tell us which syllables in a word receive which level of stress. We base our experiments on a binary-valued symbol set E1 = {W, S} and on a ternary-valued symbol set E2 = {W, S, P}, where 'W' indicates weak stress, 'S' indicates strong stress, 1 We use the ll6,000-entry CMU Pronouncing Dictionary version 0.4 for all experiments in this paper.</Paragraph> <Paragraph position="2"> Markov chain and 'P' indicates a pause. Abstractly the dictionary maps words to sequences of symbols from {primary, secondary, unstressed}, which we interpret by downsampling to our binary system--primary stress is strong, non-stress is weak, and secondary stress ('2') we allow to be either weak or strong depending on the experiment we are conducting.</Paragraph> <Paragraph position="3"> We represent a sentence as the concatenation of the stress sequences of its constituent words, with * 'P' symbols (for the N2 experiments) breaking the stream where natural pauses occur.</Paragraph> <Paragraph position="4"> Traditional approaches to lexicai language modeling provide insight on our analogous problem, in which the input is a stream of syllables rather than words and the values are drawn from a vocabulary N of stress levels. We wish to create a model that yields approximate values for probabilities of the form p(sklso, sl,..., Sk-1), where si E ~ is the stress symbol at syllable i in the text. A model with separate parameters for each history is prohibitively large, as the number of possible histories grows exponentially with the length of the input; and for the same reason it is impossible to train on limited data. Consequently we partition the history space into equivalence classes, and the stochastic n-gram approach that has served lexicai language modeling so well treats two histories as equivalent if they end in the same n - 1 symbols.</Paragraph> <Paragraph position="5"> As Figure 2 demonstrates, an n-gram model is simply a stationary Markov chain of order k = n 1, or equivalently a first-order Markov chain whose states are labeled with tuples from Ek.</Paragraph> <Paragraph position="6"> To gauge the regularity and compressibility of the training data we can calculate the entropy rate of the stochastic process as approximated by our model, an upper bound on the expected number of bits needed to encode each symbol in the best possible encoding. Techniques for computing the entropy rate of a stationary Markov chain are well known in information theory (Cover and Thomas, 1991). If {Xi} is a Markov chain with stationary distribution tt and transition matrix P, then its entropy rate is</Paragraph> <Paragraph position="8"> The probabilities in P can be trained by accumulating, for each (sx,s2,...,sk) E E k, the k-gram count in C(sl,sz,...,sk) in the training data, and normalizing by the (k - 1)-gram count C(sl, s2,. . ., sl,-1).</Paragraph> <Paragraph position="9"> The stationary distribution p satisfies pP = #, or equivalently #k = ~j #jPj,k (Parzen, 1962). In general finding p for a large state space requires an eigenvector computation, but in the special case of an n-gram model it can be shown that the value in p corresponding to the state (sl, s2,..., sk) is simply the k-gram frequency C(sl, s2,..., sk)/N, where N is the number of symbols in the data. 2 We therefore can compute the entropy rate of a stress sequence in time linear in both the amount of data and the size of the state space. This efficiency will enable us to experiment with values of n as large as seven; for larger values the amount of training data, not time, is the limiting factor.</Paragraph> </Section> <Section position="4" start_page="303" end_page="304" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"> The training procedure entails simply counting the number of occurrences of each n-gram for the training data and computing the stress entropy rate by the method described. As we treat each sentence as an independent event, no cross-sentence n-grams are kept: only those that fit between sentence boundaries are counted.</Paragraph> <Section position="1" start_page="303" end_page="304" type="sub_section"> <SectionTitle> 3.1 The meaning of stress entropy rate </SectionTitle> <Paragraph position="0"> We regard these experiments as computing the entropy rate of a Markov chain, estimated from training data, that approximately models the emission of symbols from a random source. The entropy rate bounds how compressible the training sequence is, and not precisely how predictable unseen sequences from the same source would be. To measure the efficacy of these models in prediction it would be necessary to divide the corpus, train a model on one subset, and measure the entropy rate of the other with respect to the trained model. Compression can take place off-line, after the entire training set is read, while prediction cannot &quot;cheat&quot; in this manner.</Paragraph> <Paragraph position="1"> But we claim that our results predict how effective prediction would be, for the small state space in our Markov model and the huge amount of training data translate to very good state coverage. In language modeling, unseen words and unseen n-grams are a serious problem, and are typically combatted with smoothing techniques such as the backoff model and the discounting formula offered by Good and Turing. In our case, unseen &quot;words&quot; never occur, for 2This ignores edge effects, for ~--~s C(sl, s2,..., sa) = N - k + 1, but this discrepancy is negligible when N is very large.</Paragraph> <Paragraph position="2"> Lis ten to me close ly I'll en deav or to ex plain / Schwartz.) the tiniest of realistic training sets will cover the binary or ternary vocabulary. Coverage of the n-gram set is complete for our prose training texts for n as high as eight; nor do singleton states (counts that occur only once), which are the bases of Turing's estimate of the frequency of untrained states in new data, occur until n = 7.</Paragraph> </Section> <Section position="2" start_page="304" end_page="304" type="sub_section"> <SectionTitle> 3.2 Lexicallzing stress </SectionTitle> <Paragraph position="0"> Lexical stress is the &quot;backbone of speech rhythm&quot; and the primary tool for its analysis. (Baum, 1952) While the precise acoustical prominences of syllables within an utterance are subject to certain wordexternal hierarchical constraints observed by Halle (Halle and Vergnaud, 1987) and others, lexical stress is a local property. The stress patterns of individual words within a phrase or sentence are generally context independent.</Paragraph> <Paragraph position="1"> One source of error in our method is the ambiguity for words with multiple phonetic transcriptions that differ in stress assignment. Highly accurate techniques for part-of-speech labeling could be used for stress pattern disambiguation when the ambiguity is purely lexical, but often the choice, in both production and perception, is dialectal. It would be straightforward to divide among all alternatives the count for each n-gram that includes a word with multiple stress patterns, but in the absence of reliable frequency information to weight each pattern we chose simply to use the pronunciation listed first in the dictionary, which is judged by the lexicographer to be the most popular. Very little accuracy is lost in making this assumption. Of the 115,966 words in the dictionary, 4635 have more than one pronunciation; of these, 1269 have more than one distinct stress pattern; of these, 525 have different primary stress placements. This smallest class has a few common words (such as &quot;refuse&quot; used as a noun and as a verb), but most either occur infrequently in text (obscure proper nouns, for example), or have a primary pronunciation that is overwhelmingly more common than the rest.</Paragraph> </Section> </Section> <Section position="5" start_page="304" end_page="306" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The efficiency of the n-gram training procedure allowed us to exploit a wealth of data--over 60 million syllables--from 38 million words of Wall Street Journal text. We discarded sentences not completely covered by the pronunciation dictionary, leaving 36.1 million words and 60.7 million syllables for experimentation. null Our first experiments used the binary ~1 alphabet. The maximum entropy rate possible for this process is one bit per syllable, and given the unigram distribution of stress values in the data (55.2% are primary), an upper bound of slightly over 0.99 bits can be computed. Examining the 4-gram frequencies for the entire corpus (Figure 3a) sharpens this substantially, yielding an entropy rate estimate of 0.846 bits per syllable. Most frequent among the 4-grams are the patterns WSWS and SWSW, consistent with the principle of binary alternation mentioned in section 1.</Paragraph> <Paragraph position="1"> The 4-gram estimate matches quite closely with the estimate of 0.852 bits that can be derived from the distribution of word stress patterns excerpted in Figure 3b. But both measures overestimate the entropy rate by ignoring longer-range dependencies that become evident when we use larger values of n.</Paragraph> <Paragraph position="2"> For n = 6 we obtain a rate of 0.795 bits per syllable over the entire corpus.</Paragraph> <Paragraph position="3"> Since we had several thousand times more data than is needed to make reliable estimates of stress entropy rate for values of n less than 7, it was practical to subdivide the corpus according to some criterion, and calculate the stress entropy rate for each subset as well as for the whole. We chose to divide at the sentence level and to partition the 1.59 million sentences in the data based on a likelihood measure suitable for testing the hypothesis from section 1.</Paragraph> <Paragraph position="4"> A lexical trigram backoff-smoothed language model was trained on separate data to estimate the language perplexity of each sentence in the corpus.</Paragraph> <Paragraph position="5"> Sentence perplexity PP(S) is the inverse of sentence 1 probability normalized for length, 1/P(S)r~7, where P(S) is the probability of the sentence according to the language model and ISI is its word count. This measure gauges the average &quot;surprise&quot; after revealing each word in the sentence as judged by the tri-gram model. The question of whether more probable word sequences are also more rhythmic can be approximated by asking whether sentences with lower perplexity have lower stress entropy rate.</Paragraph> <Paragraph position="6"> Each sentence in the corpus was assigned to one of one hundred bins according to its perplexity-sentences with perplexity between 0 and 10 were assigned to the first bin; between 10 and 20, the sec- null in each perplexity bin. The bin at perplexity level pp contains all sentences in the corpus with perplexity no less than pp and no greater than pp + 10. The smallest count (at bin 990) is 50662.</Paragraph> <Paragraph position="7"> ond; and so on. Sentences with perplexity greater than 1000, which numbered roughly 106 thousand out of 1.59 million, were discarded from all experiments, as 10-unit bins at that level captured too little data for statistical significance. A histogram showing the amount of training data (in syllables) per perplexity bin is given in Figure 4.</Paragraph> <Paragraph position="8"> It is crucial to detect and understand potential sources of bias in the methodology so far. It is clear that the perplexity bins are well trained, but not yet that they are comparable with each other. Figure 5 shows the average number of syllables per word in sentences that appear in each bin. That this function is roughly increasing agrees with our intuition that sequences with longer words are rarer. But it biases our perplexity bins at the extremes. Early bins, with sequences that have a small syllable rate per word (1.57 in the 0 bin, for example), are predisposed to a lower stress entropy rate since primary stresses, which occur roughly once per word, are more frequent. Later bins are also likely to be prejudiced in that direction, for the inverse reason: The increasing frequency of multisyllabic words makes it more and more fashionable to transit to a weak-stressed syllable following a primary stress, sharpening the probability distribution and decreasing entropy. null This is verified when we run the stress entropy rate computation for each bin. The results for n-gram models of orders 3 through 7, for the case in which secondary lexical stress is mapped to the &quot;weak&quot; level, are shown in Figure 6.</Paragraph> <Paragraph position="9"> All of the rates calculated are substantially less than a bit, but this only reflects the stress regularity inherent in the vocabulary and in word selection, and says nothing about word arrangement.</Paragraph> <Paragraph position="10"> The atomic elements in the text stream, the words, contribute regularity independently. To determine how much is contributed by the way they are glued together, we need to remove the bias of word choice.</Paragraph> <Paragraph position="11"> For this reason we settled on a model size, n = 6, and performed a variety of experiments with both the original corpus and with a control set that contained exactly the same bins with exactly the same sentences, but mixed up. Each sentence in the control set was permuted with a pseudorandom sequence of swaps based on an insensitive function of the original; that is to say, identical sentences in the corpus were shuffled the same way and sentences differing by only one word were shuffled similarly.</Paragraph> <Paragraph position="12"> This allowed us to keep steady the effects of multiple copies of the same sentence in the same perplexity bin. More importantly, these tests hold everything constant--diction, syllable count, syllable rate per word--except for syntax, the arrangement of the chosen words within the sentence. Comparing the unrandomized results with this control experiment allows us, therefore, to factor out everything but word order. In particular, subtracting the stress entropy rates of the original sentences from the rates of the randomized sentences gives us a figure, relative entropy, that estimates how many bits we save by knowing the proper word order given the word choice. The results for these tests for weak and strong secondary stress are shown in Figures 7 and 8, including the difference curves between the randomized-word and original entropy rates.</Paragraph> <Paragraph position="13"> The consistently positive difference function demonstrates that there is some extra stress regularity to be had with proper word order, about a hundredth of a bit on average. The difference is small indeed, but its consistency over hundreds of well-trained data points puts the observation on statistically solid ground.</Paragraph> <Paragraph position="14"> The negative slopes of the difference curves suggests a more interesting conclusion: As sentence perplexity increases, the gap in stress entropy rate between syntactic sentences and randomly permuted sentences narrows. Restated inversely, using entropy rates for randomly permuted sentences as a baseline, sentences with higher sequence probability are relatively more rhythmical in the sense of our definition from section 1.</Paragraph> <Paragraph position="15"> To supplement the ~z binary vocabulary tests we ran the same experiments with ~2 = {0, 1, P}, introducing a pause symbol to examine how stress behaves near phrase boundaries. Commas, dashes, semicolons, colons, ellipses, and all sentenceterminating punctuation in the text, which were removed in the E1 tests, were mapped to a single pause symbol for E~. Pauses in the text arise not only from semantic constraints but also from physiological limitations. These include the &quot;breath groups&quot; of syllables that influence both vocalized and written production. (Ochsner, 1989). The results for these experiments are shown in Figures 9 and 10.</Paragraph> <Paragraph position="16"> Expectedly, adding the symbol increases the confusion and hence the entropy, but the rates remain less than a bit. The maximum possible rate for a ternary sequence is log 2 3 ~ 1.58.</Paragraph> <Paragraph position="17"> The experiments in this section were repeated with a larger perplexity interval that partitioned the corpus into 20 bins, each covering 50 units of perplexity. The resulting curves mirrored the finergrain curves presented here.</Paragraph> </Section> class="xml-element"></Paper>