Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does
Michelle L Gregory
Brown University
mgregory@cog.brown.edu
Mark Johnson
Brown University
Mark Johnson@Brown.edu
Eugene Charniak
Brown University
ec@cs.brown.edu
Abstract
This paper investigates the usefulness of
sentence-internal prosodic cues in syntac-
tic parsing of transcribed speech. Intu-
itively, prosodic cues would seem to pro-
vide much the same information in speech
as punctuation does in text, so we tried to
incorporate them into our parser in much
the same way as punctuation is. We com-
pared the accuracy of a statistical parser
on the LDC Switchboard treebank corpus
of transcribed sentence-segmented speech
using various combinations of punctua-
tion and sentence-internal prosodic infor-
mation (duration, pausing, and f0 cues).
With no prosodic or punctuation informa-
tion the parser’s accuracy (as measured by
F-score) is 86.9%, and adding punctuation
increases its F-score to 88.2%. However,
all of the ways we have tried of adding
prosodic information decrease the parser’s
F-score to between 84.8% to 86.8%, de-
pending on exactly which prosodic infor-
mation is added. This suggests that for
sentence-internal prosodic information to
improve speech transcript parsing, either
different prosodic cues will have to used
or they will have be exploited in the parser
in a way different to that used currently.
1 Introduction
Acoustic cues, generally duration, pausing, and
f0, have been demonstrated to be useful for auto-
S
INTJ
UH
Oh
,
,
NP
PRP
I
VP
VBD
loved
NP
PRP
it
.
.
Figure 1: A treebank style tree in which punctuation
is coded with terminal and preterminal nodes.
matic segmentation of natural speech (Baron et al.,
2002; Hirschberg and Nakatani, 1998; Neiman et
al., 1998). In fact, it is generally accepted that
prosodic information is a reliable tool in predict-
ing topic shifts and sentence boundaries (Shriberg
et al., 2000). Sentences are generally demarcated
by a major fall (or rise) in f0, lengthening of the
 nal syllable, and following pauses. However,
the usefulness of prosodic information in sentence-
internal parsing is less clear. While assumed not
to be a one-to-one mapping, there is evidence
that there is a strong correlation between prosodic
boundaries and sentence-internal syntactic bound-
aries (Altenberg, 1987; Croft, 1995). For exam-
ple, Schepman and Rodway (2000) have shown that
prosodic cues reliably predict ambiguous attach-
ment of relative clauses within coordination con-
structions. Jansen et al. (2001) have demonstrated
that prosodic breaks and an increase in pitch range
can distinguish direct quotes from indirect quotes in
a corpus of natural speech.
This paper evaluates the accuracy of a statistical
parser whose input includes prosodic cues. The pur-
pose of this study to determine if prosodic cues im-
prove parsing accuracy in the same way that punc-
tuation does. Punctuation is represented in the vari-
ous Penn treebank corpora as independent word-like
tokens, with corresponding terminal and pretermi-
nal nodes, as shown in Figure 1 (Bies et al., 1995).
Even though this seems linguistically highly un-
natural (e.g., punctuation might indicate supraseg-
mental prosodic properties), statistical parsers gen-
erally perform signi cantly better when their train-
ing and test data contains punctuation represented
in this way than if the punctuation is stripped out
of the training and test data (Charniak, 2000; En-
gel et al., 2002; Johnson, 1998). On the Switch-
board treebank data set using the experimental setup
described below we obtained an F-score of 0.882
when using punctuation and 0.869 when punctua-
tion was stripped out, replicating previous experi-
ments demonstrating the importance of punctuation.
(F-score is a standard measure of parse accuracy, see
e.g., Manning and Schcurrency1utze (1999) for details).
This paper investigates how prosodic cues, when
encoded in the parser’s input in a manner similar to
the way the Penn treebanks encode punctuation, af-
fect parser accuracy. Our starting point is the ob-
servation that the Penn treebank annotation of punc-
tuation does signi cantly improve parsing accuracy.
Coupled with the assumption that punctuation and
prosody are encoding similar information, this led
us to try to encode prosodic information in a man-
ner that was as similar as possible to the way that
punctuation is encoded in the Penn treebanks.
For example, commas in text and pauses in speech
seem to convey similar information. In fact, when
transcribing speech, commas are often used to de-
note a pause. Thus, given the correlation between
the two, and the fact that sentence-internal punctu-
ation tends to be commas, we expected that pause
duration, coded in a way similar to punctuation,
would improve parsing accuracy in the same way
that punctuation does.
While it may be the case that the encoding of
prosodic information used in the experiments be-
low is perhaps not optimal and the parser has not
been tuned to use this information, note that exactly
the same objections could be made to the way that
punctuation is encoded and used in modern statis-
tical parsers, and punctuation does in fact dramati-
cally improve parsing accuracy.
We focus in this paper on parsing accuracy in a
modern statistical parsing framework, but it is im-
portant to remember that prosodic cues might help
parsing in other ways as well, even if they do not im-
prove parsing accuracy. Ncurrency1oth et al. (2000) point out
that prosodic cues reduce parsing time and increase
recognition accuracy when parsing speech lattices
with the hand-crafted Verbmobil grammar. Page 266
of Kompe (1997) discusses the effect that incorpo-
rating prosodic information has on parse quality in
the Verbmobil system using the TUG uni cation
grammar parser: out of the 54 parses affected by
the addition of prosodic information, 33 were judged
 better with prosody , 14 were judged  better with-
out prosody and 7 were judged  unclear . Our
experiments below differ from the experiments of
Ncurrency1oth and Kompe in many ways. First, we used
speech transcripts rather than speech recognizer lat-
tices. Second, we used a general-purpose broad-
coverage statistical parser rather than a uni cation
grammar parser with a hand-constructed grammar.
2 Method
The data used for this study is the transcribed ver-
sion of the Switchboard Corpus as released by
the Linguistic Data Consortium. The Switchboard
Corpus is a corpus of telephone conversations be-
tween adult speakers of varying dialects. The cor-
pus was split into training and test data as de-
scribed in Charniak and Johnson (2001). The train-
ing data consisted of all  les in sections 2 and 3 of
the Switchboard treebank. The testing corpus con-
sists of  les sw4004.mrg to sw4153.mrg, while  les
sw4519.mrg to sw4936.mrg were used as develop-
ment corpus.
2.1 Prosodic variables
Prosodic information for the corpus was ob-
tained from forced alignments provided by
Hamaker et al. (2003) and Ferrer et al. (2002).
Hamaker et al. (2003) provided word alignments
between the LDC parsed corpus and new alignments
of the Switchboard Coprus. Most of the differences
between the two alignments were individual lexical
items. In cases of differences, we kept the lexical
item from the LDC version. Ferrer et al. (2002)
provided very rich prosodic information including
duration, pausing, f0 information, and individual
speaker statistics for each word in the corpus. The
information obtained from this corpus was aligned
to the LDC corpus.
It is not known exactly which prosodic vari-
ables convey the information about syntactic bound-
aries that is most useful to a modern syntactic
parser, so we investigated many different com-
binations of these variables. We looked for
changes in pitch and duration that we expected
would correspond to syntactic boundaries. While
we tested many combinations of variables, they
were mainly based on the variables PAU DUR N,
NORM LAST RHYME DUR, FOK WRD DIFF MNMN N,
FOK LR MEAN KBASELN and SLOPE MEAN DIFF N in
the data provided by Ferrer et al. (2002).
While Ferrer (2002) should be consulted for full
details, PAU DUR N is pause duration normalized by
the speaker’s mean sentence-internal pause dura-
tion, NORM LAST RHYME DUR is the duration of the
phone minus the mean phone duration normalized
by the standard deviation of the phone duration for
each phone in the rhyme, FOK WRD DIFF MNMN NG
is the log of the mean f0 of the current word,
divided by the log mean f0 of the following
word, normalized by the speakers mean range,
FOK LR MEAN KBASELN is the log of the mean f0
of the word normalized by speaker’s baseline, and
SLOPE MEAN DIFF N is the difference in the f0 slope
normalized by the speaker’s mean f0 slope.
These variables all range over continuous values.
Modern statistical parsing technology has been de-
veloped assuming that all of the input variables are
categorical, and currently our parser can only use
categorical inputs. Given the complexity of the dy-
namic programming algorithms used by the parser,
it would be a major research undertaking to develop
a statistical parser of the same quality as the one
used here that is capable of using both categorical
and continuous variables as input.
In the experiments below we binned the contin-
uous prosodic variables to produce the actual cate-
gorical values used in our experiments. Binning in-
volves a trade-off, as fewer bins involve a loss of
information, whereas a large number of bins splits
the data so  nely that the statistical models used in
the parser fail to generalize. We binned by  rst con-
structing a histogram of each feature’s values, and
divided these values into bins in such a way that each
bin contained the same number of samples. In runs
in which a single feature is the sole prosodic feature
we divided that feature’s values into 10 bins, while
runs in which two or more prosodic features were
conjoined we divided each feature into 5 bins.
While not reported here, we experimented with a
wide variety of different binning strategies, includ-
ing using the bins proposed by Ferrer et al. (2002).
In fact the number of bins used does not affect the
results markedly; we obtained virtually the same re-
sults with only two bins.
We generated and inserted  pseudo-punctuation 
symbols based on these binned values that were in-
serted into the parse input as described below. In
general, a pseudo-punctuation symbol is the con-
junction of the binned values of all of the prosodic
features used in a particular run. When map-
ping from binned prosodic variables to pseudo-
punctuation symbols, some of the binned values
can be represented by the absence of a pseudo-
punctuation symbol.
Because we intend these pseudo-punctuation
symbols to be as similar as possible to normal punc-
tuation, we generated pseudo-punctuation symbols
only when the corresponding prosodic variable falls
outside of its typical values. The ranges are given
below, and were chosen so that they align with
bin boundaries and result in each type of pseudo-
punctuation symbol occuring on 40% of words.
Thus when a prosodic feature is used alone only 4 of
its 10 bins are represented by a pseudo-punctuation
symbol.
However, when two or more types of the prosodic
pseudo-punctuation symbols are used at once there
is a larger number of different pseudo-punctuation
symbols and a greater number of words appear-
ing with a following pseudo-punctuation symbol.
For example, when P, R and S prosodic annota-
tions are used together there are 89 distinct types
of prosodic pseudo-punctuation symbols in our cor-
pus, and 54% of words are followed by a prosodic
pseudo-punctuation symbol.
The experiments below make use of the following
types of pseudo-punctuation symbols, either alone
or concatenated in combination. See Figure 2 for
an example tree with pseudo-punctuation symbols
inserted.
Pb This is based on the bin b of the binned
PAU DUR N value, and is only generated when
the PAU DUR N value is greater than 0.285.
Rb This is based on the bin b of the binned
NORM LAST RHYME DUR value, and is only
generated that value is greater than -0.061.
Wb This is based on the bin b of the binned
FOK WRD DIFF MNMN N value, and is only gen-
erated when that value is less than -0.071 or
greater than 0.0814.
Lb This is based on the bin b of the
FOK LR MEAN KBASELN value, and is only
generated when that value is less than 0.157 or
greater than 0.391.
Sb This is based on the bin b of the
SLOPE MEAN DIFF N value, and is only
generated whenever that value is non-zero.
In addition, we also created a binary version of
the P feature in order to evaluate the effect of bina-
rization.
NP This is based on the PAU DUR N value, and is
only generated when that value is greater than
0.285.
We actually experimented with a much wider
range of binned variables, but they all produced re-
sults similar to those described below.
2.2 Parse corpus construction
We tried to incorporate the binned prosodic informa-
tion described in the previous subsection in a manner
that corresponds as closely as possible to the way
that punctuation is represented in this corpus, be-
cause previous experiments have shown that punc-
tuation improves parser performance (Charniak and
Johnson, 2001; Engel et al., 2002). We deleted dis-
 uency tags and EDITED subtrees from our training
and test corpora.
We investigated several combinations of prosodic
pseudo-punctuation symbols. For each of these we
generated a training and test corpus. The pseudo-
punctuation symbols are dominated by a new preter-
minal PROSODY to produce a well-formed tree.
These prosodic local trees are introduced into the
tree following the word they described, and are at-
tached as high as possible in the tree, just as punc-
tuation is in the Penn treebank. Figure 2 depicts
a typical tree that contains P R S prosodic pseudo-
punctuation symbols inserted following the word
they describe.
We experimented with several other ways of in-
corporating prosody into parse trees, none of which
greatly affected the results. For example, we also ex-
perimented with a  raised representation in which
the prosodic pseudo-punctuation symbol also serves
as the preterminal label. The corresponding  raised 
version of the example tree is depicted in Figure 3.
The motivation for raising is as follows. The sta-
tistical parser used for this research generates the
siblings of a head in a sequential fashion,  rst pre-
dicting the category label of a sibling and later con-
ditioning on that label to predict the remaining sib-
lings.  Raising should permit the generative model
to condition not just on the presence of a prosodic
pseudo-punctuation symbol but also on its actual
identity. If some but not all of the prosodic pseudo-
punctuation symbols were especially indicative of
some aspect of phrase structure, then the  raising 
structures should permit the parsing model to detect
this and condition on just those symbols. Note that
in the Penn treebank annotation scheme, different
types of punctuation are given different preterminal
categories, so punctuation is encoded in the treebank
using a  raised representation.
The resulting corpora contain both prosodic and
punctuation information. We prepared our actual
training and testing corpora by selectively remov-
ing subtrees from these corpora. By removing all
punctuation subtrees we obtain corpora that contain
prosodic information but no punctuation, by remov-
ing all prosodic information we obtain the original
treebank data, and by removing both prosodic and
punctuation subtrees we obtain corpora that contain
neither type of information.
2.3 Evaluation
We trained and evaluated the parser on the various
types of corpora described in the previous section.
S
INTJ
UH
Uh
PROSODY
*R4*
,
,
NP
PRP
I
PROSODY
*R4*
VP
VBP
do
RB
nt
VP
VB
live
PP
IN
in
NP
DT
a
PROSODY
*R3*S2*
NN
house
PROSODY
*S4*
,
,
Figure 2: A tree with P R S prosodic pseudo-punctuation symbols inserted following the words they corre-
spond to. (No P prosodic features occured in this utterance).
S
INTJ
UH
Uh
*R4*
*R4*
,
,
NP
PRP
I
*R4*
*R4*
VP
VBP
do
RB
nt
VP
VB
live
PP
IN
in
NP
DT
a
*R3*S2*
*R3*S2*
NN
house
*S4*
*S4*
,
,
Figure 3: The same sentence as in Figure 2, but with prosodic pseudo-punctuation raised to the preterminal
level.
Annotation unraised raised
punctuation 88.212
none 86.891
L 85.632 85.361
NP 86.633 86.633
P 86.754 86.594
R 86.407 86.288
S 86.424 85.75
W 86.031 85.681
P R 86.405 86.282
P W 86.175 85.713
P S 86.328 85.922
P R S 85.64 84.832
Table 1: The F-score of the parser’s output when
trained and tested on corpora with varying prosodic
pseudo-punctuation symbols. The entry  punc-
tuation gives the parser’s performance on input
with standard punctuation, while  none gives the
parser’s performance on input without any punctua-
tion or prosodic pseudo-punctuation whatsoever.
(We always tested on the type of corpora that corre-
sponded to the training data). We evaluated parser
performance using the methodology described in
Engel et al. (2002), which is a simple adaptation of
the well-known PARSEVAL measures in which punc-
tuation and prosody preterminals are ignored. This
evaluation yields precision, recall and F-score values
for each type of training and test corpora.
3 Results
Table 1 presents the results of our experiments. The
RAISED prosody entry corresponds to the raised ver-
sion of the COMBINED corpora, as described above.
We replicated previous results and showed that
punctuation information does help parsing. How-
ever, none of the experiments with prosodic infor-
mation resulted in improved parsing performance;
indeed, adding prosodic information reduced perfor-
mance by 2 percentage points in some cases. This is
a very large amount by the standards of modern sta-
tistical parsers. Notice that the general trend is that
performance decreases as the amount and complex-
ity of the prosodic annotation increased.
4 Discussion and Conclusion
Simple statistical tests show that there is in fact
a signi cant correlation between the location of
opening and closing phrase boundaries and all of
the prosodic pseudo-punctuation symbols described
above, so there is no doubt that these do con-
vey information about syntactic structure. How-
ever, adding the prosodic pseudo-punctuation sym-
bols uniformly decreased parsing accuracy relative
to input with no prosodic information. There are a
number of reasons why this might be the case.
While we investigated a wide range of prosodic
features, it is possible that different prosodic features
might improve parsing performance, and it would be
interesting to see if improved prosodic feature ex-
traction would improve parsing accuracy.
We suspect that the decrease in accuracy is due
to the fact that the addition of prosodic pseudo-
punctuation symbols effectively excluded other
sources of information from the parser’s statisti-
cal models. For example, as mentioned earlier the
parser uses a mixture of n-gram models to predict
the sequence of categories on the right-hand side
of syntactic rules, backing off ultimately to a dis-
tribution that includes just the head and the preced-
ing sibling’s category. Consider the effect of insert-
ing a prosodic pseudo-punctuation symbol on such
a model. The prosodic pseudo-punctuation symbol
would replace the true preceding sibling’s category
in the model, thus possibly resulting in poorer over-
all performance (note however that the parser also
includes a higher-order backoff distribution in which
the next category is predicted using the preceding
two sibling’s categories, so the true sibling’s cate-
gory would still have some predictive value).
The basic point is that inserting additional in-
formation into the parse tree effectively splits the
conditioning contexts, exacerbating the sparse data
problems that are arguably the bane of all statisti-
cal parsers. Additional information only improves
parsing accuracy if the information it conveys is suf-
 cient to overcome the loss in accuracy incurred by
the increase in data sparseness. It seems that punctu-
ation carries suf cient information to overcome this
loss, but that the prosodic categories we introduced
do not.
It could be that our results re ect the fact that we
are parsing speech transcripts in which the words
(and hence their parts of speech) are very reliably
identi ed, whereas our prosodic features were auto-
matically extracted directly from the speech signal
and hence might be noisier. If the explanation pro-
posed above is correct, it is perhaps not surprising
that an accurate part of speech label would prove
more useful in a conditioning context used by the
parser than a noisy prosodic feature. Note that this
would not be the case when parsing from speech rec-
ognizer output (since word identity would itself be
uncertain), and it is possible that in such applications
prosodic information would be more useful.
Of course, there are many other ways prosodic in-
formation might be exploited in a parser, and one
of those may yield improved parser performance.
We chose to incorporate prosodic information into
our parser in a way that was similar to the way
that punctuation is annotated in the Penn treebanks
because we assumed that punctuation carries infor-
mation similar to prosody, and it had already been
demonstrated that punctuation annotated in the Penn
treebank fashion does systematically improve pars-
ing accuracy.
But the assumption that prosody conveys infor-
mation about syntactic structure in the same way
that punctuation does could be false. It could also be
that even though prosody encodes information about
syntactic structure, this information is encoded in
a manner that is too complicated for our parser to
utilize. For example, even though commas are of-
ten used to indicate pauses, pauses have many other
functions in  uent speech. Pauses of greater than
200 ms are often associated with planning problems,
which might be correlated with syntactic structure
in ways too complex for the parser to exploit. While
not reported here, we tried various techniques to iso-
late different functions of pauses, such as exclud-
ing pauses of greater than 200 ms. However, all of
these experiments produced results similar to those
reported here.
Finally, there is another possible reason why our
assumption that prosody and punctuation are similar
in their information content could be wrong. Our
prosodic information was automatically extracted
from the speech stream, while punctuation was pro-
duced by human annotators who presumably com-
prehended the utterances being annotated. Given
this, it is perhaps no surprise that our automatically
extracted prosodic annotations proved less useful
than human-produced punctuation.
References
Bengt Altenberg. 1987. Prosodic patterns in spoken En-
glish: studies in the correlation between prosody and
grammar. Lund University Press, Lund.
Don Baron, Elizabeth Shriberg, and Andreas Stolcke.
2002. Automatic punctuation and dis uency detec-
tion in multi-party meetings using prosodic and lex-
ical cues. In Proc. Intl. Conf. on Spoken Language
Processing, volume 2, pages 949 952, Denver.
Ann Bies, Mark Ferguson, Karen Katz, and Robert Mac-
Intyre, 1995. Bracketting Guideliness for Treebank II
style Penn Treebank Project. Linguistic Data Consor-
tium.
Eugene Charniak and Mark Johnson. 2001. Edit detec-
tion and parsing for transcribed speech. In Proceed-
ings of the 2nd Meeting of the North American Chap-
ter of the Association for Computational Linguistics,
pages 118 126.
Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In The Proceedings of the North American
Chapter of the Association for Computational Linguis-
tics, pages 132 139.
William Croft. 1995. Intonation units and grammatical
structure. Linguistics, 33:839 882.
Donald Engel, Eugene Charniak, and Mark Johnson.
2002. Parsing and dis uency placement. In Proceed-
ings of the 2002 Conference on Empirical Methods in
Natural Language Processing, pages 49 54.
Luciana Ferrer, Elizabeth Shriberg, and Andreas Stol-
cke. 2002. Is the speaker done yet? faster and more
accurate end-of-utterance detection using prosody in
human-computer dialog. In Proc. Intl. Conf. on Spo-
ken Language Processing, volume 3, pages 2061 
2064, Denver.
Luciana Ferrer. 2002. Prosodic features for the switch-
board database. Technical report, SRI International,
Menlo Park.
Jon Hamaker, Dan Harkins, and Joe Picone. 2003. Man-
ually corrected switchboard word alignments.
Julia Hirschberg and Christine Nakatani. 1998. Acoustic
indicators of topic segmentation. In Proc. Intl. Conf.
on Spoken Language Processing, volume 4, pages
1255 1258, Philadelphia.
Wouter Jansen, Michelle L. Gregory, and Jason M. Bre-
nier. 2001. Prosodic correlates of directly reported
speech: Evidence from conversational speech. In Pro-
ceedings of the ISCA Workshop on Prosody in Speech
Recognition and Understanding, pages 77 80, Red
Banks, NJ.
Mark Johnson. 1998. PCFG models of linguis-
tic tree representations. Computational Linguistics,
24(4):613 632.
Ralf Kompe. 1997. Prosody in speech understanding
systems. Springer, Berlin.
Chris Manning and Hinrich Schcurrency1utze. 1999. Foundations
of Statistical Natural Language Processing. The MIT
Press, Cambridge, Massachusetts.
Heinrich Neiman, Elmar Noth, Anton Batliner, Jan
Buckow, Florian Gallwitz, Richard Huber, and Volkar
Warnke. 1998. Using prosodic cues in spoken dialog
systems. In Proceedings of the International Work-
shop on Speech and Computer, pages 17 28, St. Pe-
tersburg.
Elmar Ncurrency1oth, Anton Batliner, Andreas Kie ling, Ralf
Kompe, and Heinrich Niemann. 2000. Verbmobil:
The use of prosody in the linguistic components of a
speech understanding system. IEEE Transactions on
Speech and Auditory Processing, 8(5):519 532.
Astrid Schepman and Paul Rodway. 2000. Prosody
and on-line parsing in coordination structures. The
Quarterly Journal of Experimental Psychology: A,
53(2):377 396.
Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-
Tur, and Gorkhan Tur. 2000. Prosody-based auto-
matic segmentation of speech into sentences and top-
ics. Speech Communication, 32(1-2):127 154.
