Prosody and the Resolution of Pronominal Anaphora 
Maria Welters 
lnstitut l'i.ir Komlnunikationsforschung und 
Phonetik, Universitfit Bonn 
Poppelsdorfer Alice 47, D-53115 Bonn 
wolters@ikp.uni-bonn, de 
Donna K. Byron 
Department of Computer Science 
University of Rochester 
Re. Box 270226, Rochester, NY 14627 
dbyron@cs, rochester, edu 
Abstract 
In this paper, we investigate the acoustic prosodic mark- 
ing o\[" demonstrative and personal pronouns in task- 
oriented dialog. Although it has been hypothesized that 
acouslie marking affects pronoun resolution, we find 
flint {l~e prosodic information extracted from tile data is 
not sufficienl to predict antcceden! lype reliably. Inter- 
speaker variation accottnts for mt, ch of lhe prosodic vari- 
ation that we find in our data. We conclude that prosodic 
cues shot, ld be handled with care in robust, speaker- 
independenl dialog systems. 
1 Introduction 
Previous work on anaphora resolution has yieMed a rich 
basis of theories and heuristics for finding antecedenls. 
However, most research to date has neglecled an impor- 
tant potential cue that is only available in spoken data: 
prosody. Prosodic marking can be used to change the 
antecedent of a pronoun, as demonsh'ated by lifts clas- 
sic example from l.akoff ( 1971 ) (capitals indicalc a pitch 
accent): 
( I ) Johll i called Jimj a Relmblican, then hei insulted 
himj. 
(2) Johlll called Jim./ a Republican, then lll{j ill- 
sulled lllMi. 
But exactly how the antecedent changes due to the 
prosodic marking on tile pronoun, and whefller this effect 
happens consistently, is an open question. If consislcnl 
elfecls do exisl, they would be useful for online pronoun 
inlerpretation in spoken dialog systems. 
Prosodic prominence directs tile attention of the lis- 
tener to what is important for understanding and inteF 
pretation. But how should this principle be applied when 
words that are normally not very prominent, such as 
prollouns, are accented? More generally, does acous- 
tic marking provide syslemalic cues to characteristics of 
amecedents? IVlore specitically, does it imply that tile 
antecedent is "untmtml" in some wily'? These arc tile 
two hypofl/eses we investigate in this paper. ()ur data 
consists of 322 pronouns from a large corpus of spoma- 
neous lask-orientcd dialog, the TRAINS93 corpus (Hee~ 
man and Allen, 1995). This corpus allows us to study 
pronotms as they occur in spontaneous unscripted dis- 
COLlrse, al)d is erie of tile very few speech corporit to have 
been annotated with pronoun interpretation infommtion. 
The remainder oF this paper is structured as follows: 
In Section 2, we SUllllllill'iZe relevant work on pl'OllOUn 
resolution and report on tile few proposals for integrat- 
ing prosody into pronoun resolution algorithms. Next, 
in Section 3, we present tile dialogs used for our study 
and the attributes awfilable in tile annotation data, while 
Section 4 describes file acoustic measures that were corn- 
puled automatically from the data. Section 5 explores 
whelher there are syslematic correlations between these 
properties and tile acoustic measures fundamental fre- 
quency, duralion, and inlensity. For lhese measures, we 
find Ihat nlost correlations are in fact due to speaker vari- 
alien, and fl/a/ speakers differ greatly in their overall 
prosodic characleristics. Finally, we investigate whether 
it is possible to use Ihesc acoustic features to predict 
prope,'ties of tile antecedent using logistic regression. 
Again, we do not find acoustic features to be reliable 
prediclors for lhe l'catures of inleresl. Therefore, we con- 
chide in Section 6 lhat acoustic measures cannot be used 
in sl)eaker-independenl online anaphora resolution algo- 
rithms to predict lhe features under investigation here. 
2 Background and Related Work 
There is a rich literature ¢)11 resolving personal pronouns. 
Many approaches arc based on a notion of attentional 
foctls. Entities in attentional focus are highly salient, 
and pronouns are assumed to refer to tile most salient 
entity in lhc discourse (el. (Brennan el al., 1987; Az- 
zam et ill., 1998; Strube, 1998)). Centering (Grosz et 
al., 1995) is a i}amework for predicting local attentional 
focus. It assumes that tile most salient entity from sen- 
tence ,3,,_\] that is realized in sentence ,5',, is most likely 
to be pronominalized in ,3,z. That entity is termed the Cb 
(backward-looking center) of sentence ,5',,. Finding ille 
preferred ranking criteria is an active area of research. 
Byron and Stem (1998) adapted this approach, which had 
previously been applied to text, for spoken dialogs, but 
wilh linfited st,ccess. 
\]n contrast to personal pronouns, demonstratives do 
not rely on calculalions of salience. In fact, Linde (1979) 
found lhat while it was preferred for entities within the 
919 
current local t'ocus, that was used for items outside the 
current focus of attention. Passonneau (1989) showed 
that personal and demonstrative pronouns are used in 
contrasting situations: personal pronouns are preferred 
when both the pronoun and its antecedent are in sub- 
ject position, while demonstrative pronouns are preferred 
when either the pronoun or its antecedent is not ill sub- 
ject position. She also found that personal pronouns tend 
to co-specify with pronouns or base noun phrases; the 
more clause- or seutence-likc the antecedent, the more 
likely the speaker is to choose a demonstrative pronoun. 
Pronoun resolution algoritlnns tend not to cover 
demonstratives. Notable exceptions are Webber's model 
for discotn'se deixis (Webbcr, 1991) and the model de- 
veloped for spoken dialog by Eekert and Strube (1999). 
This algorithm encompasses both personal and delnon- 
strative pronouns and exploits their contrastive usage pat- 
terns, relying on syntactic clues and verb subcategoriza- 
tions as input. Neither study investigated the intluence of 
prosodic prominence on resolution. 
Most previous work on prosody and pronotm resolu- 
tion has focussed on pitch accents and third person sin- 
gular pronouns that co-specify with persons. Nakatani 
(1997) examined the antecedents of personal pronouns 
in a 20-minute narrative monologue. She found that pro- 
nouns tend to be accented il' they occur in subject po- 
sition, and if the backward-looking center (Grosz et al., 
1995) was shifted to the referent of that pronoun. She 
then extended this result to a general theory of the in- 
teraction between l)rominencc and discourse structure. 
Cahu (1995)discusses accented prorJouns on the ba- 
sis of a theory about accentual correlates of salience. 
Kamcyama (1998) interprets a pitch accent on pronouns 
in the fl'amework of Ihe alternative semantics (Rooth, 
1992) theory o1' focus. She assumes that all potential an- 
tecedents are stored in a list. Pronouns arc then resolved 
to the most preferred antecedent on that list which is syn- 
tactically and semantically compatible with the pronoun. 
Preference is modeled by an ordering on the set ol' an- 
tecedents. An accent on lhe pronoun signals that pro- 
noun resolution should not be based on the default order- 
ing, where the default is computed by a nmnber of in- 
teracting syntactic, semantic, pragmatic, and attentional 
constraints. 
Compared to he and she, it and that lmve been some- 
what neglected. There are two reasons for this: First, it 
is not considered to be as accentable as he and she by 
native speakers of both British and American English, 
whereas that is more likely than it to beat" a pitch ac- 
cent. An informal study of the London-Lund corptts of 
spoken British English (Svartvik, 1990) confirmed that 
observation. Second, that fi'cquently does not lmve a 
co-specifying NP antecedent, and most research on co- 
speciticatiou has focussed on pronouns and NPs. Work 
on accented demonstratives and pronoun resolution is ex- 
tremely scarce. Pioneering studies were conducted by 
Ft'ethcim and his collaborators. They tested the effect of 
accented sentence-initial demonstratives that co-specify 
with the preceding sentence on the resolution of ambigu- 
ous personal pronouns, and found that the pronoun an- 
tecedents switched when the demonstrative was accented 
(Fretheim ct al., 1997). However, to otu" knowledge, 
there are no studies that compare the co-specification 
preferences of accented vs. unaccented demonstratives. 
3 The Corpus: TRAINS93 
Our data is taken from the TRAINS93 corpus of hunlun- 
human problem solving dialogs in the logistics phnuting 
domain. In these dialogs, one participant plays the role 
of the planning assistant and the other attempts to con- 
struct a plan for delivering specified cargo to its destina- 
tion. We used a subset of 18 TRAINS93 dialogs in which 
the referent and antecedent of third-person non-gendcrcd 
pronouns I had been attnotated in a previous study (By- 
ron and Allen, 1998). In the dialogs used for the present 
study, 322 pronouns (158 personal and 164 demonstra- 
live) have been annotated. Personal pronouns ill the di- 
alogs are it, its, itselJ; them, the3,, their and themselves. 
Demonstrative pronouns in the annotation data are that, 
this, these, those. There are live nmle and 11 fenmle 
speakers. One female speaker contributed 89 pronouns, 
two others produced more than 30 each (one female, one 
male), the rest is divided unevenly among tile remain- 
ing 13 speakers. The set of dialogs chosen for annota- 
tion intentionally included a variety of speakers so that 
no speaker's idiosyncratic discourse strategies would be 
prevalent ill the resulting data. 
Table 1 describes the attributes caplurcd for each 
pronoun. These features were chosen for tile annota- 
tion because many previous studies have shown them 
to be imporlant for pronoun resolution. Features ill- 
clude attributes of the pronoun, its antecedent (the dis- 
cotu'se constituent Ihat previously triggered lhe refer- 
ent), and its referent (the entity that should be substi- 
tuted for the pronoun in a semantic representation of 
the sentence). Cb was annotated using Model3 from 
(Byron and Stent, 1998) with a linear model of dis- 
course structure. Note that annolaled prononns were 
not limited to those with NP antecedents, as is tile case 
with most other studies. In addition to NP antecedents, 
pronouns in this data set could have an antecedent of 
some other phrase or clause type, or no annomtablc an- 
tecedent at all. There are two categories of pronouns 
with no annotalable antecedent. Ill the simplest case, 
tim pronominal reference is the first mention of the ref- 
erent ill tile dialog. That happens when the referent is in- 
ferred liom the problem solving state. For example, af- 
ter" tile utterance send the engine to Coming 
and pick up the boxcars, a new discourse en- 
I No gendcred entities exist in this co,'pus, so gendered pronouns 
wc,-c not inchtdcd. All dcmonst,'ativc pronouns were annolated; how- 
evcf, lhcre were only 5 occurrences of "this" in the selected dialogs, 
so eonstrasts between proxinml and distal dcmonslratives could not be 
studied. 
920 
Feature 11) l)escriplion 
I'RONTYPE Pronoutl Type 
I'RONSUB,I Pronoun is suljccl 
ANTI,\]I~()I{M Antecedenl form 
I)IST I)islance to antecedent 
ANTESUILI Antecedent is subjcc! 
CB Backward-looking center 
|trOllOU 11 
category 
Possible Values 
def= tile pronoun is one of {it, its, itself, them, dmy, thcin themselves} 
dcm = the inonoun is one of {that, this, these, fllose} 
Y = prOllOtltl subject of lllaill clause of its ullerance 
N : pronotm not subject of main clause 
I'I~,()NOUN = antecedent is pronoun 
NI' = antecedent is tmse noun phrase 
N()N-NP = antecedent is other constituent, at most one utterance long 
NONE = pronotm is lit'st mention or antecedent length > one tttterance 
SAME = antecedent and pronoun in same utterance 
AI)J = antecedent and pronoun in adjacent utterances 
RI{MOTE = antecedent more than one utterance before pronoun 
Y = alSteccdelll subject o1' the lllain chmse of its tttterance 
N = antecedent not subject of a main clause 
Y = pronoun is Cb of its utterance 
N = pronoun is not Cb 
DIST 
cldj. I'CqllO\[¢ 
Table 1: The features avaihtble ill the annotation data set. 
ANTE ANTESUBJ 
NP/pmn. non-NP none yes no same 
75.9% 6.3% 17.8 % 37.3% 62.7% 29.1% 
28.0% 36.1t0% 36.0% 14.0% 86.11% 18.9% 
51.60{, 21.4% 27.0% 25.5% 74.5% 23.9% 
personal 33.5% 20.2% 
demonslrafive 29.9% 15.2% 
lolal 31.7% 17.7% 
3hble 2: Typical properties of antccedcnts lbr personal and demonst,'ative pronouns ill file corpus. All percentages 
are given relative to tile lolal ntnnber of pronouns in that category and rounded. Boldface: most frequent antecedent 
property. 
tity, tile train composed of tile engine and Ix)xcars, is 
awfilable for anaphoric reference. In the more subtle 
case, Ihe entity was built from a stretch (51" discourse 
longer than one utterance. In an effort to achieve an ac- 
ceptable level of inier-annotalor agreelnenl for the aw 
nohltion, the maxinmm size \[or a consfiluenl to serve as 
~tll ~ltllecedelll W\[lS de\[illed l(1 be OllC ullCl'~,lllCC, l)iscourse 
entities that are built fi'om longer she/chcs of lexl include 
objects such as tile entire 131an or tile discourse itself, and 
such items are less reliable lo annotate. 
qaking the annotated dialogs as a whole, 21.4% of all 
prollouns have ;.l non-NP antecedent, and 27% do not 
have an mmolatal~le antecedent at a11. qhble 2 shows thal 
tile default antecedenls o1' personal and denlonsh'alive 
pronouns follow the predictions of Schiffman (1985). 
The antecedent of personal pronouns is most likely itself 
lo be a pronoun or a full NP, while demonstratives m'e 
most likely to have no antecedent, or if there is one, it is 
ntost likely to be a non-NR The main role of prosodic ill- 
lksr,nation is to help pronoun resolution algorithms iden- 
tify cases where flmse default predictions are false. 
4 Acoustic Prosodic Cues 
Our selection (51' acottstic measures covers three classic 
components of prosody: fundamental frequency (IV()), 
duration, and intensity (Lehiste, 1970). The relation- 
ship between those cues and prosodic pronlinencc has 
been demonstrated by e.g. (Fant and Kruckenberg, 1989; 
Heufl, 1999). Tile main correlate of English stress is F0, 
the second rues! imporlant is duration, and the least im- 
porlanl is inlensity (1,chisle, 1970). Therefore, we will 
pay more allelllioll lo F0 illeflsUl'eS. Although cxperi- 
menial results indicate flint 1;0 cues of pronlinencc can 
depend on the shape of file 1:0 conlour of the uucranec 
(c.f. (Gussenhoven cl al., 1997)), we do nol control for 
such illleraclions. \]llstead, we reslricl ourselves to cues 
that are easy to COnlpute fr(ml limiled dala, so that a run- 
ning spoken dialogue system might be able to compute 
them in real time. 
4.1 Acoustic Measures 
Duration: For duration, we found lhat 1he logarith- 
mic duration wllues a,'c nornmlly distributed, bolh pooled 
over all speakers and for lhoso speakers willl more than 
20 pronouns. Logariflmtic duration is also tile target vari- 
able of many duration models such as that of (van San- 
ten, 1992). We assume that speaker-related variation is 
covered by the w,'iance of lhis normal distribution; we 
can control for speaker effects by including a SPEAKER 
factor in our models. 
F0 variables: F0 was computed using the \]2ntropic 
ESPS Waves tool get_f0 with standard settings and a 
frame rate (51' 10 ms. All F0 wdues were transt'onned into 
lhe log-domain and then pooled imo mean, minimum, 
and maximum F0 values for each word and each utter- 
ance. This log donmin is well motiw~led psychoacousti- 
cally (Zwicker and lhtstl, 1990). F0 range was computed 
oil the values in tile log-domain. We assume lhat the Iog- 
m'ithm of F0 has a nomml distribution. Therefore, we 
921 
can nommlize for speaker-dependent differences in pitch 
range by using z-scores, and we can use standard statis- 
tical analysis methods such as ANOVA. 
Intensity: Intensity is measured as the root-mean- 
square (RMS) of signal amplitudes. We measure 
RMS relative to a baseline as given by the formula 
log(l{MS/RMSb~olino). The baseline RMS was com- 
puted on the basis of a simple pause detection algorithm, 
which takes the first nmximum in the amplitude his- 
togram to be the average amplitude of background noise. 
The baseline RMS was slightly above that value. 
4.2 Inter-Speaker Differences 
Since we need to pool data from many different speak- 
ers, we qeed to control for inter-speaker differences. 
Tim number of pronouns we have fl'om each speaker 
varies between 1 for speaker GD and 86 for speaker 
CK. Speakers PH, male, and CK, female, are the 
only ones to lmve produced more than 15 personal 
pronouns and 15 demonstratives. In order to test 
whether the SPEAKER factor affects the choice be- 
tween personal pronouns and demonstratives, we tit- 
ted a logistic regression model with the target variable 
PRONTYPE (personal or demonstrative) and the predic- 
torsANTE, ANTESUBJ, DIST, REFCAT, CBand 
SPEAKER (in this sequence). REFCAT is an additional 
variable that describes the senmntic category of a pro- 
noun's referent (eg. donmin objects vs. abstract enti- 
ties). Even though SPEAKER is the last factor in the 
model, an analysis of deviance shows a signilicant intlu- 
euce (p<0.005,F=2.51,df13). A possible explanation 
for this is that some speakers prefer to use demonstra- 
tives in contexts where others would choose a personal 
pronotm, and vice versa, or perhaps the SPEAKER vari- 
able mediates the intluence of a far ,nore complex factor 
such as problem solving strategy. Resolving this ques- 
lion is beyond the scope of this paper. 
On the basis of F0, we can establish four groups of 
speakers: The first group consists of male speakers with 
a low mean F0 and a low F0 range. In the next group, 
we find both male and female speakers with a low mean 
F0, but a far higher range. Speaker PH belongs to this 
second group. Interestingly, for these speakers, the mean 
F0 on pronouns is lower titan for those of the first group. 
Groups 3 and 4 consist entirely of female speakers, with 
group 3 using a lower range than group 4. Speaker CK 
belongs to group 4. 
5 Exploring Prominent Pronouns 
If data about prosodic prominence is to be useful for pro- 
noun resolution, then there must be prosodic cues that 
carry information about properties of the antecedent. In 
this section, we investigate if there are such cues for the 
properties that we have available in the annotation data, 
defined in ~lable 1. More specitieally, we hypothesize 
that prosodic cues will be used if the antecedent is some- 
what unusual. For example, the results of Linde and 
Property 
ANTEFORFI 
DIST 
ANTESUBJ 
df 
all 
3 range 
3 none 
2 dur 
Dam Set 
peJw. dem. CK 
110110 none none 
llono none none 
dur, nolle \])ors.: 
mean energy 
range 
Table 3: Significant Inlluences of Antecedent Proper- 
ties (p <0.05) on Prosodic Cues. inean=z-score mean 
F0, range=range of z-score F0, dur=logarithmic dura- 
tion, dem=demonstratives, pets=personal pronotms 
Passonneau would lead us to expect that personal pro- 
nouns with non-NP antecedents and demonstratives with 
NP and pronoun antecedents will be marked. Since the 
antecedents of pronouns tend to occur no more than 1-2 
clauses ago, we would also expect pronouns with more 
remote antecedents to be marked. A first qualitative look 
at the data suggets that even il' such these tendencies are 
present in the data, they might not turn out to be signifi- 
cant. For example, in Figure 1, the means of lzmeanf0 
behave roughly as predicted, but the variation is so large 
that these differences might well be due to chance. 
5.1 Correlations between Measures and Properties 
Next, we examine whether the measures delined in Sec- 
lion 4 correlate with any particular properties o1' the 
antecedent. More precisely, if a property is cued by 
some aspect ot' prosody (either duration, F0, or inten- 
sity), then the prosody of a pronoun depends to a cer- 
lain degree on its antecedent. In a statistical analysis, 
we should lind a significant effect of the relevant an- 
tecedent property on the prosodic measure. We selected 
ANOVA as our analysis method, because our prosodic 
target variables appear to have a normal distribution. For 
each of the antecedent features delined above, we ex- 
amined its inlluence on mean F0 (imeanf0), the z- 
score of mean F0 (lzmeanf0), the z-score of F0 range 
(lzrgf0), logarithmic duration (dur), and normalized 
energy (energy). In addition, we added the tactors, 
PRONTYPE and SPEAKER. 
Results: The results are summarized in Table 3. For 
izmeanf0 and energy, the influence of SPEAKER 
is always considerable. There are also consistent ef- 
fects of the syntactic position of a pronoun: In general, 
demonstratives are shorter in subject position, and for 
CK, mean F0 on personal pronouns in subject position 
is higher than on non-subject ones (228 Hz vs. 190 Hz). 
But when we turn to the factors that interest us lnOSt, 
properties of the antecedent, we cannot lind any consis- 
tent correlates, although in ahnost every data set, there 
are some prosodic cues to ANTESUBJ for personal pro- 
nouns. But what these cues are may well depend on the 
speaker, as the results for CK show. Her pitch range on 
pronouns with a stdjcct antecedent is double the range 
on pronouns with an antecedent in non-su/lject position. 
922 
E 
P¢l'SOlllll lirollOUllS 
T --:- 
--7 : : 
" L ~ 
o ~ 
o o o° 
o 
I I I ~-- 
NI I lie alllt~ non-NP I/lO 
"\['ype of AtllCci~dt~lll 
F~ 
7 
t ) I_qll O II,"i \[ fat i'¢ (~ PfoIIOIln~ 
: 8 
NP IIO1|11\[~ noII-NP pfO 
Type of Alltcccdcnl 
y~ 
K 
8 
o 
-- f 
t~on slkhi 
Persol|lll PfollOtlllS 
- \]-- 
1 ; 
o 
o 
o 
l-- I 
II(II1L~ stilt 
Ai/tecctletll is Stlbjc'ct 
E o 
i 
Demonstrative Pronouns 
m 
Cl 
fi, 8 
t --\]--- i 
I|OII-SlIIj Ill)lie sulj 
AiItecctlct~l is Subject 
Figure 1: Distribution el: z-score of mean F0 for dilferellt values of ANTEFORM and ANTESUI3J 
Pronouns with subject antecedents are also considerably 
louder. All ill all, antecedent prol)ertics can only ac- 
COUllt for a very small percelltage of tile wtriatioll in 
these prosodic cues. Therefore, we should i~ot expect the 
prosodic cues to be slablc, robust indicators for predict- 
ins antecedent properlies ill spoken dialog systems. 
5.2 Inter-Speaker Variation 
we have sccn that inter-speaker di ffcrcl~ces cxpl;~i n much 
of the variation in the prosodic measures. Table 4 gives 
an idea of the size and direction of these differences. 
On the complete data set, wc lilKl that personal pro- 
nouns are shorlor lhan demonslratives, they have a lower 
intensity and show a higher average 1;0 (3~tble 4). A 
closer examination reveals considerable inter-speaker 
variation in the data, illustrated in Table 4. CK is fairly 
ptototypical. PH barely shows the difference il~ F0, al~d 
for MF, the difference in intensity is actually reversed. 
MF also has rather shor! demonstratives. Such speaker- 
specilic wlriation callnot be eliminated by nomtalization. 
It has to be controlled for in the statistical lcsls. Dis- 
covering types of speakers is diflicult - two of the 15 
speakers, CK, and PH, con/ribute 48% of all pronouns. 
5.3 Predicting Properties of tile Antecedent 
Finally, we examine how much information prosodic 
cues yield about the ~tntecedent. For this purpose, we 
set till a prediction lask not unlike one that all actual 
NLU syslenl l~lces. The input variables arc the prosodic 
properties of the pronoun, whether the protloun is per- 
sonal or demonstrative (P\]R.ONTYPE), whether it is the 
subject (PRONSUBJ), and whether it is sentence-initial 
(PRONZNIT). From this, we now have to deduce l~roper - 
lies of thc antecedent: syntactic i'olc (ANTESrdBJ), fern1 
(ANTEFORM), and distance (DZST). For prediction, wc 
used logistic regression (Agresti, 1990). This has two ad- 
vantages: not only can wc compare how well the differ- 
cnt regression models lit the data, wc call also re-analyze 
the titled model to determine which factors have a signif- 
icant inlluence oll classiIication accuracy. 
Firsl, we conslrucl a model on the basis of 
PRONTYPE, PRONSUBJ, and PRONINIT. Then, 
we conslruct a model with these three faclors plus 
SPEAKER.. finally, we train a model with PRONTYPE, 
923 
Speaker 
dis'c. 
all 156 Hz 
CK 188Hz 
PH 126 Hz 
MF 166 Hz 
mean F0 
pelw. dem. 
157 Hz 142 Hz 
208 Hz 187 Hz 
109 Hz 110 Hz 
184 Hz 182 Hz 
z-score mean 
pets. dent 
-0.04 -0.24 
0.31 0.00 
-0.43 -0.47 
0.32 0.26 
duration 
pelw. dem 
161 ms 206 ms 
151 ms 193 ms 
179ms 252 ms 
166 ms 164 ms 
intensity 
petw. dem 
2.36 2.38 
2.51 2.54 
2.57 2.84 
2.69 2.40 
Table 4: Inter-speaker variation in prosody, disc.: complete discourse. All speakers: 322 pronouns, CK: 41 personal, 
45 demonstrative, PH: 18 personal, 24 demonstrative, MF: 7 personal, 8 demonstrative 
PRONSUBJ, PRONINIT, SPEAKER and one of the 
three measures lzmeanf0, dur, energy. The mod- 
els are trained to predict whether there is an antecedent 
(task noAnte), whether the antecedent is a non-NP 
(task nonNP), whether the antecedent is remote (task 
remote), whether the antecedent is in subject position 
(task u j ante), and whether the antecedent is the current 
Cb (task cb). All models are computed over the full data 
set, because the data set for speaker CK is not suflicient 
• for estimating the regression coefficients. The models 
are then compared to see which step yielded a significant 
improvement: adding SPEAKER or adding the prosodic 
variable after we have accounted for SPEAKER variation. 
Results: The results arc summarized in Table 5. On 
all tasks except remote, PRONTYPE and PRONSUBJ 
performed well. Both features have ah'oady been shown 
to be reliable cnes for prononn resoluti(m (c.f. Sec- 
tion 2). On task cb, only PRONTYPE can explain a 
signilicant amount of wuiation. Models which include 
a speaker factor ahnost always fare better. In models 
without speaker information, F0-relaled measures yield 
a larger reduction in deviance than the duration measure. 
The reason for this is that the F0 measures preserve some 
information about the ditl'ercnt speaker strategies. Once 
SPEAKER has been included as well, only dur leads 
to significant improvements on task nonNP (p<0.05). 
Both demonstratives and personal pronouns are shorter 
when the antecedent is a non-NR 
6 Conclusion and Outlook 
In this paper, we cxamincd patterns of acoustic prosodic 
highlighting of personal and demonstrative pronouns in 
a corpus of task-oriented spontaneous dialog. To our 
knowledge, this is the lirst comparative study of this 
kind. Wc used a straightforward, theory-neutral opera- 
tionalization of "prosodic highlighting" that does not de- 
pend on complex algorithms for F0 stylization or (focal) 
accent detection and is thus very easy to incorporate into 
any real-time spoken dialog system. We chose a spo- 
ken dialog corpus that includes demonstrativc pronouns 
because demonstratives are both a prominent feature of 
problem-solving dialogs and a sorely neglected lield of 
study. In particular, we asked two questions: 
Do Speakers Signal Antecedent Properties 
Acoustically? Based on our data, the answer to this 
question is: If they do,/hey do it in a highly idiosyncratic 
way. We cannot posit any safe generalizations over sev- 
eral speakers, and li"om the perspective of an NLP appli- 
cation, such generalizations might even be dangerous. In 
order to evaluate the impact of speaker strategies on the 
resolution of pronouns, we need more data - 150 to 200 
pronouns from 4-5 speakers each. Collecting this amount 
of data in a dedicated corpus is inefficient. Therefore, 
further acoustic investigations do not make much sense at 
this point; rather, the data should be examined carefully 
for tendencies which can form the basis for dedicated 
production and perception experiments which arc explic- 
itly designed for uncovering inter-speaker variation. 
Are Acoustic Features Useful for Pronoun 
Resolution? The answer is: probably not. At least for 
this corpus, we were not able to determine any numeri- 
cal heuristics that could be utilized to aid pronoun reso- 
lution. The logistic regression experiments show that on 
a speaker-independent basis, logarithmic duration might 
well be a reliable cue to certain aspects of a pronoun's 
antecedent. In order to incorporate prosodic cues into 
an actual algorithm, we will need more training material 
and a principled evaluation procedure. We will also need 
to take into account other influences, such as dialog acts 
and dialog structure. 
Acknowledgements. Wc would like to thank the three 
anonymous reviewers, Rebecca Passonneau, Lncicn 
Galescu, James Alhm, Michael Strube, Dictmar Lancd 
and Wolf gang Hess for their comments on earlier vet'- 
sions of this work. Donna K. Byron was funded by ONR 
research grant N00014-95-1-1088 and Columbia Univer- 
sity/NSF research grant OPG:1307. For all statistical 
analyses, wc used R (Ihaka and Gentleman, 1996). 
924 
\]hsk 
nonNP 
noAnte 
remote 
sjante 
cb 
significant illfluence 
PRONTYPE, PRONSUBJ, PRONINIT, dur 
PRONTYPE, PRONSUBJ, PRONINIT, SPEAKER. 
110110 
PRONTYPE, PRONSUBJ 
PRONTYPE, SPEAKER 
Table 5: Perlbrmance of Reg,'ession Models on Tasks. Listed are factors which improve perfornmnce signilicantly 
(p < 0.05) 

References 

A. Agresti. 1990. Categorical Data Analysis. John Wi- 
ley. 

S. Azzam, K. Humphreys, and R. Gaizauskas 1998. 
E×tending a Simple Co,'eference Algorithm with a 
Focusing Mechanism. In New Approaches to Dis- 
course Anaphora: Proceedings of the Second Collo- 
quium on Discoulwe Anaphora and Anaphor Resolu- 
tion (DAARC2), pages 15-27. 

S. Brennan, M. Friedman, and C. Pollard. 1987. A cen- 
tering approach to pronouns. In Proceedings of the 
25 th Ammal Meeting of the Association.fi~r Compu- 
tational Linguistics (ACL '87), pages 155-162. 

D. Byron and J. Allen. 1998. Resolving demonstra- 
tive pronouns in the TRAtNS93 corpus. In New Ap- 
proaches to Discoume AnaFhora: Proceedings of 
the Second Colloquium on Discou/we AnaFhora attd 
Anaphor Resolution (DAARC2), pages 68 - 81. 

I). Byron and A. Stem. 1998. A preliminary model of 
centering in dialog. In Proceedings of tire 36 th An- 
total Meeting of the Association .for Computational 
Linguistics (A CL '98). 

3. Cahn. 1995. The effect of pitch accenting on pro- 
norm referent resolulion. In Proceedings of the 33 tj~ 
Ammal Meeting of" the Association./or Computatiomd 
£ingtdstics (ACL '95), pages 290-292. 

M. F, ckert and M. Strubc. 1999. Resolving discourse de- 
ictic anaphora in dialogs. 111 I~roceedings oJ" the 9 th 
Coq/'erence of the European Chapter of the Associa- 
tion Jbr Conq)utational Linguistics ( I';ACL '99). 

G. Fant and A. Kruckenberg. 1989. Preliminaries 
to tim study of Swedish prose reading and reading 
style. KT'II Speech 7)ansmission Laborato O, Quar- 
terly Progress and Status Report, 2: 1-83. 

T Frelheinl, W. wm 1)onmlelen, and K. 13orthen. 1997. 
lJnguislic constraints on relevance in reference reso- 
lution. In K. Singer, R. Eggert, and G. Anderson, edi- 
tors, CLS, volume 33, pages 99-113. 

B. Grosz, A. Joshi, and S. Weinstein. 1995. Cenlering: 
A framework for modeling the local coherence of dis- 
course. ComputationalLinguistics, 21(2):203-226. 

C. Gussenhoven, B.H. Repp, A. Rietveld, II. P, ump 
and J. Terken. 1997. The perceptual prominence of 
t'undanmntal frequency peaks. J. Acoust. Soc. Ame,:, 
102:3009-3022. 

P. Heeman and J. Allen. 1995. The Trains Spoken l)ia- 
log Corpus. CD-ROM, lJngt, istic Data Consortium. 

B. Heuft 1999. F, ine prominenzbasierte Methode zttt 
Prosodieanalyse trod -synthese. Peter Lang, Frank- 
furt. 

R. Ihaka and R. Gentlenmn (1996). R: A  for 
data analysis and graphics. Journal q/Co/nputational 
and Graphical Statistics, 5:299-314. 

M. Kameymna. 1998. Stressed Pronouns. 111 R Bosch, 
R. van Sandt, editors, The Focus Book, pages 89-112. 
()xford University Press, Oxford. 

G. lmkoff. 1971. P,'esuppositions and relative well- 
formedness. Iii Semantics: An InteMisciplinao, 
Reader in Philosophy, Linguistics, and l'sydtology, 
pages 329-340. Cambridge University Press. 

I. Lchiste. 1970. Suprasegmentals. Mrl" Press, Cam- 
bridge, Mass. 

C Limle. 1979. Focus o1' attention and the choice of 
pronouns in discourse. In qhhny Given, editor, &,max 
and Semantics 12: Discomwe arid ,S),ntax, New York. 
Academic Press. 

C. Nakatani. 1997. The Computational Processing of 
lntonatiolml Prominence: A Functional Prosody Per- 
spective. Ph.l). thesis, Harvard University. 

R. Passonneau. 1989. Gelling al discourse referents. In 
Proceedings of the 27 u' Ammal Meeting of the Associ- 
ation for Computational Linguistics (ACL '89), pages 
51-59. 

M. Rooth. 1992. A theory of focus interpretation. Natu- 
ral Language Semantics, 1:75-112. 

P,. Schiffnlan (Passo,mcau). 1985. DA'course con- 
strair2ts on 'it' and 'that': A study of la/tguage use in 
career-courtseling interviews. Ph.\]). thesis, Universily 
of Chicago. 

\]. Swulvik, editor. 1990. 7he London Coq)us o.fSl~oken 
English: Descril)tion and Reseamh. l.und Universily 
Press, Lund. 

M. Strube 1998. Never look back: An alternative to 
centering. In Proceedings (!/" the 36 th Amlual Meet- 
ing o.f the Association for Comtmtational Linguistics 
(ACL '98), pages 1251-1257. 

J. van Sanlen 1992 Contexlt, al effects on vowel dura- 
tion. Speech Co/mmmication, 11:513-546. 

B. Webber. 1991. Structure and ostension in the inter- 
pretation of discourse deixis. Language and Cognitive 
Processes, 6:107-135. 

E. Zwicker and H. Fastl 1990. Psychoacoustics. 
Springer, Be,'lin. 
