The value of minimal prosodic information in spoken language 
corpora 
Caroline Lyon and Jill Hewitt 
Computer Science Department 
University of Hertfordshire, UK 
Abstract 
This paper reports on an investigation into rep- 
resenting tone unit boundaries (pauses) as well 
as words in a corpus of spoken English. An 
analysis of data from MARSEC (Machine Read- 
able Spoken English Corpus) shows that, for 
professional speakers, the inclusion of this nfin- 
imal prosodic information will lower the per- 
plexity of a language model. The analysis is 
based on information theoretic techniques, and 
an objective method of evaluation is provided by 
entropy indicators, which are explained. This 
result is of general interest, and supports the 
development of improved language models for 
many applications. The automated capture of 
pauses seems to be technically feasible, and war- 
rants further investigation. The specific issue 
which prompted this investigation is a task in 
broadcasting technology: the semi-autonmted 
production of online subtitles for live television 
programmes. This task is described, and an ap- 
proach to it using speech recognition technology 
is explained. 
1 Introduction 
There are a range of choices to be made when 
deciding on the contents of spoken language cor- 
pora. Useful information will be made avail- 
able by including prosodic annotations but this 
is difficult to automate. This paper looks at 
the advantages that can be expected to accrue 
from the inclusion of minimal prosodic informa- 
tion, major and minor tone unit boundaries, or 
pauses. Capturing this information automati- 
cally does not present the same difficulties as 
producing a full prosodic annotation: a method 
is described by (Huckvale and Fang, 1996). A 
purpose of this paper is to show that the devel- 
opment of such methods warrants further inves- 
tigation. 
The task which prompted this investigation 
will use trained speakers in a controlled envi- 
ronment, as described below. 
We investigate how much extra information 
is captured by representing major and minor 
pauses, as well as words, in a corpus of spoken 
English. We find that for speech such as broad- 
cast news or talks the inclusion of this minimal 
prosodic annotation will lower the perplexity of 
a language model. 
This result is of general interest, and supports 
the development of improved language models 
for many applications. However, the specific is- 
sue which we address is a task in broadcasting 
technology: the semi-automated production of 
online subtitles for live television programmes. 
Contemporaneous subtitling of television pro- 
grammes for the hearing impaired is set to in- 
crease significantly, and in the UK there is a 
mandatory requirement placed on T\: broad- 
casters to cover a certain proportion of live pro- 
grammes. This skilled task is currently done 
by highly trained stenographers (the subtitlers), 
but it could be semi-automated. First, the sub- 
titlers can transfer broadcast speech to writ- 
ten captions using automated speech recogni- 
tion systems instead of specialist, keyboards. A 
similar approach is being introduced for some 
Court reporting, where trained speakers in a 
controlled environment may replace traditional 
stenographers. Secondly, the display of the cap- 
tions can be automated, which is the problem 
we address here. It is necessary to place line 
breaks in appropriate places so the subtitler can 
be relieved of this secondary task. 
Based on previous work in this field (Lyon 
and Frank, 1997) a system will be developed to 
process the output of an ASR device. \Ve need 
to collect examples of this output as corpora of 
training data, and have investigated the type 
40 
of information that it will be useful to obtain. 
Commercially available speech recognizers typi- 
cally output only the words spoken by the user, 
but as an intermediate stage in producing sub- 
titles we may want to use prosodic information 
that has been made explicitly available. 
Information theoretic techniques can be used 
to determine how useful it is to capture minimal 
prosodic information, major and minor pauses. 
The experiments reported here take a prosodi- 
cally marked corpus of spoken English, the Ma- 
chine Readable Spoken English Corpus (MAR- 
SEC). Different representations are compared, 
in which the language model either omits or in- 
cludes major and minor pauses as well as words. 
We need to assess how nmch the structure 
of language is captured by the different meth- 
ods of representation, and an objective method 
of evaluation is provided by entropy indicators, 
described below. 
The relationship between prosody and syntax 
is well known (Arnfield, 1994; Fang and Huck- 
vale, 1996; Ostendorfand \'ielleux, 1994; Taylor 
and Black, 1998). \Vork in this field has typi- 
cally focussed on the problem in speech syn- 
thesis of mapping a text onto natural sounding 
speech. Our work investigates the complemen- 
tary problem of mapping speech onto segmented 
text. 
It is not claimed that pauses are only pro- 
duced as structural markers: hesitation phe- 
nomena perform a number of roles, particularly 
in spontaneous speech. However, it has been 
shown that the placement of pauses provide 
clues to syntactic structure. We introduce a sta- 
tistical measure that can help indicate whether 
it is worth going to the trouble of capturing cer- 
tain types of prosodic information for processing 
the output of trained speakers. 
Contents of paper 
This paper is organised in the following way. 
In Section 2 we describe the MARSEC corpus, 
which is the basis for the analysis. Section 3 
describes the entropy metric, and explains the 
theory behind its application. Section 4 de- 
scribes the experiments that were done, and 
gives the results. These indicate that it will be 
worthwhile to capture minimal prosodic infor- 
mation. In Section 5 we describe the subtitling 
task which prompted this investigation, and the 
paper concludes with Section 6 which puts our 
work into a wider context. 
2 MARSEC : Machine Readable 
Spoken English Corpus 
Since trained speakers will be producing the 
subtitles a corpus of speech from professional 
broadcasters is appropriate for this initial in- 
vestigation. The MARSEC corpus has been 
mainly collected from the BBC, and is available 
free on the web (see references). It is marked 
with prosodic annotations but not with POS 
tags. \Ve have used part of the corpus, just 
over 26,000 words, comprising the 4 categories 
of news commentary (A), news broadcasts (B), 
lectures aimed at a general audience (C) and 
lectures aimed at a restricted audience (D). 
The prosodic markup has been done manu- 
ally, by two annotators. Some passages have 
been done by both, and we see that there is a 
general consensus, but some differing decisions. 
Interpreting the speech data has an element of 
subjectivit3: In Table 1 we show some sample 
data as we used it, in which only the major and 
minor tone unit boundaries are retained. \Vhen 
passages were marked up twice, we chose one 
in an arbitrary way, so that each annotator was 
chosen about equally. 
2.1 Comparison of automated and 
manual markup of pauses 
We suggest that the production of this type of 
data may be technically feasible for a trained 
speaker using an ASR device, and is worth in- 
vestigating further. 
(Huckvale and Fang, 1996) describe their 
method of automatically capturing pause infor- 
mation for the PROSICE corpus. The detection 
of major pauses is technically straightforward: 
they find regions of the signal that fall below 
a certain energy threshold (60Db) for at least 
250ms. Minor pauses are more difficult to find, 
since they can occur within words, and their 
detection is integrated into the signal / word 
alignment process. 
\\re find that in that in the manually anno- 
tated MARSEC corpus, the ratio of words to 
major pauses is approximately 17.8, to minor 
pauses 5.4, or 4.1 if both type of pause are taken 
together. (Huckvale and Fang, 1996) quote fig- 
ures that work out at 7.7 for major pauses, 30.8 
for minor ones, or 6.15 taken together. This 
suggests that there is some discrepancy between 
41 
Key: 
\[I is major pause \] is minor pause 
annotator 1 annotator 2 
we we 
heard heard 
automatic automatic 
fire fire 
I \] 
a a 
few few 
yards yards 
away away 
I i\] 
we we 
drove drove 
Onl on 
II II 
a a 
jet jet 
appeared appeared 
Table 1: Example of MARSEC corpus with minimal prosodic annotations 
what is considered a major or minor pause. 
However, taking both together the resuks from 
the automated system is not out of line with the 
manual one on this statistical measure. 
3 Entropy indicators 
The entropy is a measure, in a certain sense, of 
the degree of unpredictability. If one representa- 
tion captures more of the structure of language 
than another, then the entropy measures should 
decline. 
If H represents the entropy of a sequence and 
P the perplexity, then 
p=2 H 
P can be seen as the branching factor, or num- 
ber of choices. 
3.1 Definition of entropy 
Let ..4 be an alphabet, and X be a discrete ran- 
dom variable. The probability mass function is 
then p(x), such that 
p(x) = probability(X = x), x E .,4 
For an initial investigation into the entropy of 
letter sequences the x's would be the 26 letters 
of the standard alphabet. 
The entropy H(X) is defined as 
H(X) = - Z p(x) • Io9 p(x) 
xE.A 
If logs to base 2 are used, the entropy mea- 
sures tile minimum number of bits needed on 
average to represent X: the wider the choice 
the more bits will be needed to describe it. 
\Ve talk loosely of the entropy of a sequence, 
but more precisely consider a sequence of sym- 
bols Xi which are outputs of a stochastic pro- 
cess. We estimate the entropy of the distribu- 
tion of which the observed outcome is typical. 
Further references are (Bell et al., 1990; Cover 
and Thomas, 1991), or, for an introduction, 
(Charniak, 1993, Chapter 2). 
3.1.1 Shannon's work 
Though we are investigating sequences of words, 
the subject is introduced by recalling Shannon's 
well known work on the entropy of letter se- 
quences (Shannon, 1951). He demonstrated 
that the entropy will decline if a representation 
is found that captures (i) the context and (ii) 
the structure of the sequence. 
Shannon produced a series of approximations 
to the entropy H of written English, which sue- 
cessively take more of the statistics of the lan- 
42 
guage into account. H0 represents the average 
number of bits required to determine a letter 
with no statistical information. Thus, for an 
alphabet of 16 symbols H0 = 4.0. 
H1 is calculated with information on single 
letter probabilities. If we knew, for example, 
that letter e had probability of 20~ of occurring 
while z had 1% we could code the alphabet with, 
on average, fewer bits than we could without 
this information. Thus H1 would be lower than 
H0. 
H2 uses information on the probability of 2 
letters occurring together; Hn, called the n- 
gram entropy, measures the amount of entropy 
with information extending over n adjacent let- 
ters of text, 1 and Hn _< Hn-1. As n increases 
from 0 to 3, the n-gram entropy declines: the 
degree of predictability is increased as informa- 
tion from more adjacent letters is taken into ac- 
count. This fact is exploited in games where the 
contestants have to guess letters in words, such 
as the "Shannon game" or "Hangman" (Jelinek, 
1990). 
The formula for calculating the entropy of a 
sequence is given in (Lyon and Brown, 1997). 
An account of the process is also given in (Cover 
and Thomas, 1991, chapter2) and (Shannon, 
1951). 
3.2 Entropy and structure 
The entropy can also be reduced if some of the 
structure of the letter strings is captured. As 
Shammn says "a word is a cohesive group of let- 
ters with strong internal statistical influences" 
so the introduction of the space character to 
separate words should lower the entropy H2 and 
Ha. With an extra symbol in the alphabet H0 
will rise. There will be more potential pairs and 
triples, so H2 and H3 could rise. However, as 
the space symbol will prevent "irregular" letter 
sequences between words, and thus reduce the 
unpredictability H~ and Ha do in fact decline. 
For instance, for the words 
COOKING CHOCOLATE 
the trigrams 
replaced by 
"space-C-H'.. 
"N-G-C" and "G-C°H" will be 
"N-G-space", "G-space-C" and 
1This notation is derived from that used by Shannon. 
It differs from that used by (Bell et al., 1990). 
3.3 The entropy of ASCII data 
For other representations too, the insertion of 
boundary markers that capture the structure of 
a sequence will reduce the entropy. Gull and 
Skilling (1987) report on an experiment with a 
string of 32,768 zeroes and ones that are known 
to be ASCII data organised in patterns of 8 
as bytes, but with the byte boundary marker 
missing. By comparing the entropy of the se- 
quence with the marker in different positions 
the boundary of the data is "determined to a 
quite astronomical significance level". 
3.4 The entropy of word sequences 
This method of analysis can also be applied to 
strings of words. The entropy indicator will 
show if a sequence of words can be decom- 
posed into segments, so that some of the struc- 
ture is captured. Our current work investigates 
whether pauses in spoken English perform this 
role. 
Previously we showed how the entropy of 
text mapped onto part-of-speech tags could be 
reduced if clauses and phrases were explicitly 
marked (Lyon and Brown, 1997). Syntactic 
markers can be considered analogous to spaces 
between words in letter sequence analysis. They 
are virtual punctuation marks. 
Consider, for example, how subordinate 
clauses are discerned. There may be an explicit 
opening marker, such as a 'wh' word, but often 
there is no mark to show the end of the clause. 
If markers are inserted and treated as virtual 
ptmctuation some of the structure is captured 
and the entropy declines. A sentence without 
opelfing or closing clause boundary markers, 
like 
The shirt he wants is in the wash. 
can be represented as 
The shirt { he wants } is in the 
wash. 
This sentence can be given part-of-speech 
tags, with two of the classes in the tagset rep- 
resenting the symbols '{' (virtual-tagl) 
and '}' (virtual-tag2). The ordinary part-of- 
speech tags have probabilistic relationships with 
the virtual tags in the same way that they do 
with each other. The pairs and triples generated 
by the second string exclude (noun, pronoun), 
(noun, pronoun, verb) but include, for instance, 
(noun, virtual-tag1), (noun, virtual-tag1, pro- 
noun) 
43 
Using this representation, the entropy, H2 
and H3, with virtual tags explicitly raarking 
some constituents is lower than that without the 
virtual tags. In a similar way the words from 
a speech signal can be segmented into groups, 
with periodic pauses. 
4 Results from analysis of the 
MARSEC corpus 
We can measure the entropy H0, Hi, .H2 and 
H3 for the corpus with and without prosodic 
markers for major and minor pauses. However, 
rather than use words themseh'es we map them 
onto part-of-speech tags. This reduces an in- 
definite number of words to a limited number 
of tags, and makes the investigation computa- 
tionally feasible. We expect 
• H0 will be higher with a marker, since the 
alphabet size increases 
• H1, which takes into account the single 
element probabilities, will increase or de- 
crease depending on the frequency of the 
new symbol. 
• H2 and H3 should fall if the symbols repre- 
senting prosodic markers capture some of 
the language structure. We expect: H3 to 
show this more than H2. 
• If instead of the real pause markers mock 
ones are inserted in an arbitrary thshion: 
we expect H to rise in all cases. 
To conduct this investigation the MARSEC 
corpus was taken off the web; and pre-processed 
to leave the words plus major and minor tone 
unit boundaries, or pauses. Then it was auto- 
matically tagged, using a version of the Claws 
tagger 2. These tags were mapped onto a smaller 
tagset with 26 classes, 28 including the major 
and minor pauses. The tagset is given in the 
appendix. Random inspection indicated about 
96% words correctly tagged. 
Then the entropy of part of the corpus was 
calculated (i) for words only (ii) with minor 
pauses represented (iii) with major pauses rep- 
resented and (ix') with major and minor pauses 
represented. Results are shown in Table 2, and 
in Figure 1. 
~Claws4, supplied by the University of Lancaster, de- 
scribed by Garside (1987) 
H3 is calculated in two different ways. First, 
the sequence of tags is taken as an uninterrupted 
string (column H3 (1) in Table 2). Secondly, 
we take the major pauses as equivalent to sen- 
tence ends, points of segmentation, and omit 
an)" triple that spans 2 sentences (column H3 
(2)). In practice, this will be a sensible ap- 
proach. 
This experiment shows how the entropy H3 
declines when information on pauses is ex- 
plicitly represented. Though there is not a 
transparent mapping from prosody to structure, 
there is a relationship between them which can 
be exploited. These experiments indicate that 
English language used by professional speakers 
can be coded more efficiently when pauses are 
represented. 
4.1 Comparison with arbitrarily placed 
pauses 
Compare these results to those of another ex- 
periment where the corpora of words only were 
taken and pauses inserted in an arbitrary man- 
ner. Major pauses were inserted every 19 words, 
minor pauses every 7 words, except where there 
is a clash with a major pause. The numbers 
of major and minor pauses are comparable to 
those in the real data. Results are shown in 
Table 3. H2 and H3 are higher than the compa- 
rable entropy levels for speech with pauses in- 
serted as they were actually spoken. Moreover, 
the entropy levels are higher than for speech 
without any pauses: the arbitrary insertion has 
disrupted the underlying structure, and raised 
the unpredictability of the sequence. 
4.2 Entropy and corpus s\]ze 
Note that we are interested in comparative en- 
tropies. The entropy converges slowly to its 
asymptotic value as the size of the corpora in- 
creases, and this is an upper bound on en- 
tropy values for smaller corpora. Ignoring this 
may give misleading results (Farach and et al., 
1995). The reason why entropy is underesti- 
mated for small corpora comes from the fact 
that we approximate probabilities by frequency 
counts, and for small corpora these may be poor 
approximations. The count of pairs and triples 
is the basis for the probability estimates, and 
with small corpora many of the triples in par- 
ticular that will show later have not occurred. 
Thus the entropy and perplexity measures un- 
44 
3.5 I I I I I 
CO I 
2.5 
E 
O~ 
x= 2 O 
(D (D 
P ~: 1.5 
Q. 
P- 1 
w 
0.5 
Words + markers 
Words only 
0 f I I I I 
0 5000 10000 15000 20000 25000 
Corpus sizein words 
30000 
Figure 1: Comparison of trigram part-of-speech entropy for sections of the MARSEC corpus, 
(i) with both major and minor pauses marked (ii) without either. The tagset size is 28 with the 
pauses represented, 26 without them. The entropy is calculated with trigrams spanning a major 
pause omitted, as in Table 2 column H3 (2) 
derestimate their true values. 
5 The subtitling task 
We now show how the investigation described 
here is relevant to the subtitling task. Trained 
subtitlers are employed to output real time cap- 
tions for some TV programmes, currently as a 
stream of type written text. In future this may 
be done with an ASR system. In either case, 
the production of the caption is followed by the 
task of displaying the text so that line breaks 
occur in natural positions. The type of pro- 
grammes for which this is needed include con- 
temporaneous commentary on on news events, 
sports, studio discussions, and chat shows. 
Caption format is limited by line length, 
and there are usually at most 2 lines per cap- 
tion. Some examples of subtitles that have been 
displayed are taken from the broadcast com- 
mentary o11 Princess Diana's funeral, with line 
breaks as shown: 
As I said the great tenor bell is 
half muffled with 
a piece of leather around its 
clapper 
They now bear the coffin of the 
Princess 
of Wales into Westminster Abbey. 
An example ~oma chat show is: 
Who told you that you resemble Mr 
Hague? 
45 
Speech Number of Number of 
representation minor pauses major pauses 
X\brds only 0 0 
Words + minor pauses 3454 0 
Words + major pauses 0 1029 
Words + both pauses 3454 1029 
H0 H1 H2 H3 H3 
(1) (2) 
4.70 4.11 3.29 2.94 2.94 
4.75 4.09 3.18 2.84 2.84 
4.75 4.19 3.32 2.94 2.84 
4.81 4.17 3.16 2.82 2.70 
Table 2: Entropy measures for 18655 words of the MARSEC corpus, (sections A, B, C concatenated) 
with and without major and minor pauses. H3 (2) measures entropy without triples spanning a 
major pause (see text). 
Speech Number of Number of 
representation minor pauses major pauses 
Words + pauses in 3109 1209 
arbitrary positions 
H0 Hi H2 H3 (i) 
4.81 4.19 3.63 3.05 
Table 3: Entropy measures for the same part of the MARSEC corpus with pauses in arbitrary 
positions : a major pause every 19 words, minor pause every 7 words (except for clashes with 
major) 
I work at a golf club and we have 
lots 
of societies and groups come in. 
The quality of the subtitles can be improved 
by placing the line breaks and caption breaks in 
places where the text would be naturally seg- 
mented. Though this is partially a subjective 
process, a style book can be produced that gives 
agreed guidelines. 
Some of the poor line breaks can be readily 
corrected, but the production of a high quality 
display overall is not a trivial task. The pauses 
in speech do not map straight onto suitable line 
breaks, but they are a significant source of infor- 
mation. In this work we have been considering 
the output of trained speakers, or the recording 
of rehearsed speech. This differs from ordinary, 
spontaneous speech, where hesitation phenom- 
ena may have a number of causes. However, 
in the type of speech we are processing we have 
shown that the use of pauses captures some syn- 
tactic structure. An example given by (Osten- 
dorf and Vielleux, 1994) is 
Mary was amazed Ann Dewey was 
angry. 
which was produced as 
Mary was amazed \[\[ Ann Dewey was 
angry. 
To illustrate a problem of text segmentation 
consider how conjunctions should be handled. 
Now conjunctions join like with like: verb with 
verb, noun with noun, clause With clause, and 
so on. If a conjunction joins two single words, 
such as ::black and blue" we do not want it to 
trigger a line break. However, it may be a rea- 
sonable break point if it joins two longer com- 
ponents. Consider the following example from 
the MARSEC corpus: 
it held its trajectory for one 
minute I flashes burst I from 
its wings I and rockets 
exploded \] safely behind us II 
The word "and" without the pause marked 
is part of a trigram :;noun conjunction noun" 
which would typically stick together. In fact, 
it actually joins two sentences, and would be a 
good point for a break. By including the pause 
marker we can identify this. 
The proposed system for finding line breaks 
will integrate rule based and data driven com- 
ponents. This approach is derived from earlier 
work in a related field in which a partial parser 
has been developed (Lyon and Frank, 1997). It 
will be based on a part-of-speech trigram model 
combined with lexical information. We will be 
able to develop a better language model if we ex- 
plicitly include a representation for major and 
minor pauses. 
46 
6 Role of pauses in speech in a 
wider context 
As hypothesized the entropy, H3, declines as 
the major and minor boundary markers are in- 
serted. This indicates that it will be worthwhile 
to capture the prosodic information on major 
and minor pauses from the ASR, in addition to 
the usual transcription of the words themselves. 
Our investigation was prompted by a spe- 
cific task in which the output of trained speak- 
ers is transcribed automatically. However, it 
is of wider interest. \\re show that represent- 
ing pauses as well as words helps determine the 
structure of language, and thus contribute to 
the quality of a language model. 
It is many years since Mandelbrot investi- 
gated the way in which the statistical structure 
of language is best adapted to coding words 
(Mandelbrot, 1952). He suggested that lan- 
guage is "intentionally if not consciously pro- 
duced in order to be decoded word by word in 
the easiest possible fashion." If we accept his 
suggestion we would expect that naturally oc- 
curring events, such as pauses in speech, are 
utilised to facilitate the transfer of information. 
Patterns of speech segmentation are likely to 
emerge to produce an efficient coding (Lyon; 
1998). 

References 
S Arnfield. 1994. P~vsody and Syntax in Cor- 
pus Based Analysis of Spoken English. Ph.D. 
thesis, University of Leeds. 
T C Bell, J G Cleary, and I H \Vitten. 1990. 
Text Compression. Prentice Hall. 
E Charniak. 1993. Statistical Lar:guage Learn- 
ing. MIT Press. 
T hi Cover and J A Thomas. 1991. Elements 
of Information Theory. John Wiley and Sons 
Inc. 
Alex Chengyu Fang and Mark Huckvale. 1996. 
Synchronising syntax with speech signals. In 
V.Hazan, M.Holland, and S.Rosen, editors, 
Speech, Hearing and Language. University 
College London. 
M Farach and M Noordewier et al. 1995. On 
the entropy of dna. In Symposium on Dis- 
crete Algorithms. 
R Garside. 1987. The CLAWS word-tagging 
system. In R Garside, G Leech, and G Samp- 
son, editors, The Computational Analysis of 
English: a corpus based approach, pages 30- 
41. Longman. 
S Gull and J Skilling. 1987. Recent develop- 
ments at cambridge. In C Ray Smith and 
Gary Erickson, editors, Maximum -Entropy 
and Bayesian Spectral Analysis and Estima- 
tion Problems. 
M Huckvale and A C Fang. 1996. PROSICE: 
A spoken English database for prosody re- 
search. In S Greenbaum, editor, Comparing 
English Worldwide: The International Cor- 
pus off English. 0 U P. 
F Jelinek. 1990. Self-organized language mod- 
eling for speech recognition. In A Waibel and 
K F Lee, editors, Readings in Speech Recog- 
nition, pages 450-503. Morgan Kaufmann. 
IBM T.J.Watson Research Centre. 
C Lyon and S Brown. 1997. Evaluating Parsing 
Schemes with Entropy Indicators. In MOL5, 
5th Meeting on the Mathematics of Language. 
C Lyon and R Frank. 1997. Using Single Layer 
Networks for Discrete, Sequential Data: an 
Example from Natural Language Processing. 
Neural Computing Applications, 5 (4). 
C Lyon. 1998. Language evolution: survival 
of the fittest in the statistical enviromnent. 
Technical report, Computer Science Depart- 
ment, University of Hertfordshire, June. 
B Mandelbrot. 1952. An informational theory 
of the statistical structure of language. In 
Symposium on Applications of CommuT~ica- 
tion Theory. Butterworth. 
.~I Ostendorf and N Vielleux. 1994. A hierar- 
chical stochastic model for automatic predic- 
tion of prosodic boundary location. Compu- 
tational Linguistics, 20(1). 
C E Shannon. 1951. Prediction and Entropy of 
Printed English. Bell System Technical Jour- 
nal, pages 50-64. 
P Taylor and A Black. 1998. Assigning phrase 
breaks from part-of-speech sequences. 
