Issues in the Transcription of English Conversational Grunts 
Nigel Ward 
• Mech-In.fo Engineering, University of Tokyo, 
Bunkyo-ku, Tokyo 113-8656, Japan 
nigel@sanpo.t.u-tokyo.ac.jp 
http://www.sanpo.t.u-tokyo.ac.jp/~ nigel/ 
Abstract 
Conversational grunts, such as uh- 
huh, un-hn, rnrn, and oh are ubiq- 
uitous in spoken English, but no 
satisfactory scheme for transcrib- 
ing these items exists. This pa- 
per describes previous approaches, 
presents some facts about the pho- 
netics of grunts, proposes a tran- 
scription scheme, and evaluates its 
accuracy. 1 
1 The Importance of 
Conversational Grunts 
:Conversational grunts, such as uh-huh, un-hn~ 
ram, and oh are ubiquitous in spoken English. 
In our conversation data, these grunts occur 
an average of once every 5 seconds in Amer- 
ican English conversation. In a sample of 79 
conversations from a larger corpus, Switch- 
board, urn was the 6th most frequent item 
(after /, and, the, you, and a), and the four 
items uh, uh-huh, um and urn-hum accounted 
for 4% of the total. These sounds are not only 
frequent, they are important in  use. 
To mention just one example, people learn- 
ing English as a second  are handi- 
capped in informal interactions if they cannot 
produce and recognize these sounds. 
1I would like to tb.nlr Takeki Kamiyama for pho- 
netic label cross-checld-g, all those who let me record 
their conversations, and the anonymous referees; and 
also the Japanese 1Vr;nlqtry of Education, the Sound 
Technology Promotion Foundation, the Nakayama 
Foundation, the Inamori Foundation, the Interna- 
tional Communications Fonndation and the Okawa 
Foundation for support. 
Just to be clear about definitions, in this 
paper 'grunts 2' means sounds which are ~not 
words', where a prototypical "word" is a 
sound having 1. a clear meaning, 2. the abil- 
ity to participate in syntactic constructions, 
and 3. a phonotactically normal pronuncia- 
tion. For example, uh-huh is a grunt since it 
has no referential meaning, has no syntactic 
affinities, and has salient breathiness. In this 
paper 'conversational' refers to sounds which 
occur in conversation and are at least in part 
directed at the interlocutor, rather than be- 
ing purely self-directed 3. Both of these defi- 
nitions have flaws, but they provide a fairly 
objective criterion for delimiting the set of 
items which any transcription scheme should 
be able to handle. 
The phenomena circumscribed by this def- 
inition are a subset of "vocal segregates" 
(Trager, 1958) and of "interjections": the dif- 
ference is that it limits attention to sounds 
occurring in conversations. This definition 
also roughly delimits the subset of "discourse 
markers" or "discourse particles" which occur 
in informal spoken discourse. 
As the phonetics and meanings of conver- 
sational grunts are currently not well under- 
stood, we have begun a project aiming to elu- 
cidate, model, and eventually exploit them. 
The current paper is a report on an approach 
2 It may seem that the negative connotations of the 
word 'grunt' maire it inappropriate for use as a tech- 
nical term, but the phenomenon itself is often stlg- 
matised, and so the term is appropriate in that sense 
too. 
STwo rules of thnmh were adopted to help in cases 
which were difllcult to judge: consider laughter as not 
conversational, and consider as conversational every- 
thing else that might possibly be playing some com- 
municative role, even if it isn't clear what that role 
might be. 
29 
to the preliminary problem of how to tran- 
scribe these sounds. 
A generally usable, standardized transcrip- 
tion scheme would be of great value. Im- 
mediate applications include screenplay writ- 
ing and court recording. It would also fa- 
cilitate the systematic corpns-based study of 
the meanings and functions of these sounds 4. 
There are also prospects for applications in 
systems. One could imagine a dialog tran- 
scription system that produces output with 
the grunts represented in enough detail to 
show whether a listener is being enthusias- 
tic, reluctant, non-committal, bored, etc., as 
these states are often indicated by grunts 
rather than by words. One could imagine 
spoken dialog systems which prompt and con- 
firm concisely with such grunts, instead of full 
words or phrases. And one could imagine spo- 
ken dialog systems which adjust their output 
based on barge-in feedback from the user such 
as uh-huh meaning "go on, don't talk so slow", 
uh-hum meaning "stop, I need to think", and 
ah meaning "I have something to say". 
Section 2 surveys previous approaches to 
grunt transcription, Section 3 proposes a 
slightly new scheme, Section 4 discusses its 
adequacy, and Section 5 points out some open 
issues. 
2 Previous Schemes for Grunt 
Transcription 
This section points out the problems with pre- 
vious approaches to grunt translation. 
2.1 Phonetically Accurate Schemes 
One tradition in labeling grunts is to use a 
completely general scheme. The central inspi- 
ration here is the fact that grunts are unlike 
words, in that they contain sounds which are 
never seen in the lexical items of the . 
As such, they can fall outside the coverage 
of even the International Phonetic Alphabet, 
which is only designed to handle those sounds 
4This is not to say that there can be a strict order- 
ing of activities here: on the contrary, it is not pos- 
sible to fix a transcription standard without at least 
a tacit theory of the meanings and functions of the 
items being tra~ibed. Some thoughts on this ap- 
pear elsewhere (Ward, 2000). 
which occur contrastively in some words in 
some . Thus there have been pro- 
posals for richer, more complete transcription 
schemes, capable of handling just about any 
communicative noise that people have been 
observed to produce, including moans, cries 
and belches (Trager, 1958; Poyatos, 1975). 
One disadvantage of these notations is that 
they are not usable without training. 
A second disadvantage is that their gener- 
ality is excessive for everyday use. As seen 
below, the vast majority of conversational 
grunts are drawn from a much smaller inven- 
tory of sounds. 
A third disadvantage is that they provide 
more accuracy than is needed. For exam- 
ple, in English there appear to be no grunts 
in which the difference between an alveolar 
nasal, a velar nasal, or nasalization of a vowel 
conveys a difference in meaning, and so these 
do not need to be distinguished in transcrip- 
tion. 
2.2 A Functlon-based Schemes 
An alternative approach is seen in some 
schemes used for labeling corpora for pur- 
poses of training and evaluating speech rec- 
ognizers. A quote from the most recent 
Switchboard labeling standard (Hamaker et 
al., 1998) gives the flavor: 
20. Hesitation Sounds: Use "uh" 
or "ah" for hesitations consisting 
of a vowel sound, and "urn" or 
"hm" for hesitations with a nasal 
sound, depending upon which tran- 
scription the actual sound is closest 
to. Use "huh" for aspirated version 
of the hesitation as in "huh? <other 
speaker responds> um ok, I see your 
point." 
21: yes/no sounds: Use "uh-huh" 
or "um-hum" (yes) and "huh-uh" 
or "hum-tun" (no) for anything re- 
motely resembling these sounds of 
assent or denial" 
Another scheme (Lander, 1996) lists several 
"miscellaneous words", including: 
30 
"nuh uh" (no), "ram hmm" (yes), 
"hmm mmm" (no), 'hnm ram" (no), 
"uh huh" (yes), "huh uh" (no), "uh 
uh" (no) 
The inspiration behind these schemes 
seems to be the idea that grunts are just like 
words. This leads to two assumptions, both 
of which are questionable. First, there is the 
assumption that each grunt has some fixed 
meaning and some fixed functional role (filler, 
back-channel, etc). However, many specific 
grunt sounds can be found in more than one 
functional role, as seen in Table 1. Second, 
there is the assumption that the set of conver- 
sational grunts is small. However the number 
of observed grunts is not small~ as seen in Ta- 
ble 2, and the set of possible grunts is prob- 
ably not even finite: for example, it would 
not be surprising at all to hear the sound 
hura-ha-har~ in conversation, or hem-ha-an, or 
hurn-ha-un, and so on, and so on. (However, 
not every possible sound seems likely to be a 
conversational grunt; for example ziflug would 
seem a surprising novelty, and would be down- 
right weird in any of the functional positions 
typical for grunts.) 
One concrete problem with these schemes 
is that they are not designed to allow pho- 
netically accurate representations of grunts 5. 
In particular, they make the task of the la- 
beler a rather strange one. Given a grunt, 
first he must examine the context to deter- 
mine whether it is a back-channel or a filler, 
then determine whether it sounds affirmative 
or negative, and only then can he consider 
what the actual sound is, and his options are 
limited to picking one of the labels in the func- 
tional/semantic category. The relation be- 
tween the letters of the label and the phonet- 
ics of the grunt becomes somewhat arbitrary. 
This would be more tolerable if there was a 
clear tendency for each grunt to occur in only 
one functional position, but this is not the 
case, as noted above. The use of the aifirma- 
tive/negatlve distinction as a primary classi- 
ficatory feature is also also open to question. 
In our corpus, only 1% of the grunts were neg- 
ative in meaning, and these were all in con- 
texts where a negative answer was expected 
or likely, so this distinction is a strange choice 
for a top-level dividing principle. Moreover, 
negative grunts are, in fact, characterized by 
two-syllables with a sharp syllable boundary, 
often a glottal stop, and/or a sharp down- 
step in pitch, and/or a lack of breathiness, 
but these features are reflected only tenuously 
in the spellings listed as possible for negative 
grunts in these schemes. 
2.3 Naive Transcription 
The third tradition in transcribing grunts is to 
allow labelers to just spell them in the 'usual' 
way, as one might see them written in the 
comics or in a detective novel. The inspiration 
behind this is that native speakers generally 
have had a lot of exposure to orthographic 
representations of grunts, and can be trusted 
to do the right thing. 
One problem with this tradition is that the 
mapping from letter sequences to the actual 
sounds is not clear. For example, a conversa- 
tion transcription given as a textbook exam- 
ple of good practice includes "u" and "uh", 
and "oh" and "oo" (Hutchby and Wooffitt, 
1999), without footnoting. Presumably the 
%o" means /u/, but it could also possibly 
mean a version of "oh" with strong lip round- 
hag, or a longer form of "oh", or perhaps a 
shorter form (if the labeler was trying to avoid 
confusion with the archaic vocative "o'). En- 
glish orthography is phonetically ambiguous 
and not standardized for grunts. 
A second problem with this tradition is that 
creaky voice (vocal fry), although pragmati- 
cally significant, is generally not represented 
(although many practitioners are surprisingly 
diligent at noting occurrences of breathiness). 
2.4 Summary of Desiderata 
Ideally we want a scheme for transcribing 
grunts which 
I. is easy to learn and use, 
5 Th.ls is acceptable if the only aim is to train speech 
recognizers, where the speech recognizers' acoustic 
models will end up capturing the possible phonetic 
variation without human intervention, and if the 
speech recognition results are not intended for actual 
use, but merely to be fed into an algorithm for COl- 
puting recognition scores. 
31 
total back- channel filler dis- 
fluency 
\[clear-throat\] 2 1 
tsk 22 . 12 2 
ah 7 1 3 3 
aum 5 4 1 
hh 3 
hmm 2 
huh 2 
m-hm 2 :i 
r-am 2 2 
mmm 3 ',.) 
myeah 2 2 
nn-hn 4 4 
oh 20 6 
oh-okay 2 1 
okay 8 2 2 
u-uh 4 2 
uh 38 14 21 
uh-hn 2 
uh-huh 3 3 
uh-uh 2 1 1 
nhh 2 2 
ukay 2 1 I 
um 20 10 8 
,,ram 5 5 
uu 5 2 2 
uum 5 3 2 
yeah 71 27 19 1 
(other) 72 34 19 3 
Total 317 91 108 45 
isolate response confirm- ation 
6 6 6 
8 3 
20 13 8 
final other 
1 
7 
1 i 
5 
1 
1 1 
1 1 
1 
2 4 
I 4 
6 26 
Table 1: Counts of Grunt Occurrences in various positions and functional roles, for all grunts 
occurring 2 or more times in our corpus 
\[clear-throat\] 2 
tsk 23 
tsk-naa 1 
tsk-neeu 1 
tsk-ooh 1 
tsk-yeah 1 
\[inhale\] 1 
\[unsticking\] 4 
aa 
achh 1 
ah 7 
ahh 1 
ai 1 
am 
BO 
aDO 
aum 
eah 
ehh 
h-Ylllrllq~ 
haah 
hh 
hh-ae~h 
hhh 
hhh-uuuh 
hhn 
hmm 
hmm'ml'nrn 
1 Im 
lm-lm 
huh 
i 
iiyeah 
1 m-hm 
1 mm 
I ~m-hm 
1fflffn-IYiYrt 
1 vn'rnrn 
1 myeah 
1 nn-hn 
nn-nnn I 
nu 1 
nuuuuu 1 
nyaa-haao 1 
nyeah 1 
o-w 1 
oa 1 
oh 20 
oh-eh I 
oh-kay 1 
oh-okay 2 
oh-yeah 1 
okay 8 
okay-hh I 
ooa I 
ookay 1 
oooh I 
ooooh I 
oop-ep-oop I 
u-kay 1 
Table 2: All Grunts in our Corpus, with 
u-uh 4 
u-uun 1 
uam 1 
uh 38 
u.h-hn 2 
uh-hn-uh-hn 1 
uh-hu.h 3 
uh-~ 1 
uh.-uh 2 
u.h-uhmmm I 
nhh 2 
uhbh 1 
.hhm 1 
ulmy 2 
21 
um-hm-u.h-hm 1 
Rl-lr11'n 
~----n,,Hn 1 
au-lm 1 
un\]my 1 
unununu 1 
uu 5 
uum 5 
unmm 1 
uun 1 
uutth 1 
uuuuuuu 1 
WOW 1 
yah-yeah 1 
ye 1 
yeah 71 
yeah-oksy 1 
yeah-yeah I 
yeahaah I 
yeah.h 1 
yegh 1 
yeh-yeah I 
yei I 
yo 1 
yyeah I 
numbers of occurrences 
32 
2. can represent all observed grunts, and 
3. unambiguously represents all meaningful 
differences in sound. 
While it is not possible to devise a single 
transcription scheme which is perfect for all 
purposes (Barry and Fourcin, 1992), it is clear 
that the current schemes all have room for 
improvement. 
3 Proposal 
The basic idea is to start with the naive tran- 
scription tradition and then tighten it up. 
The advantages of using this as a starting 
point are two. First, it's convenient, since 
it is ASCII, familiar, and requires no special 
training. Second, as the result of the cumu- 
lative result of many years of novelists' and 
cartoonists' efforts to represent dialog, it has 
presumably evolved to be fairly adequate for 
capturing those sounds variations which are 
significant to meaning. 
The biggest need is to clarify and regular- 
ize the mapping from transcription to sound. 
This is the primary contribution of this paper: 
a specification of the actual phonetic values 
of each of the letters commonly used in tran- 
Scribing conversational grunts, as follows: 
u means schwa. This causes no confusion be- 
cause high vowels, including/u/, are van- 
ishingly rare in conversational grunts. 
n generally means nasalization. This is un- 
familiar in that English, unlike French, 
has no nasalized vowels in the words of 
the lexicon. However in grunts nasaliza- 
tion is common, as in ~n-hn and nyeah, 
and meaning-bearing. Occasionally there 
may be nasal consonants, and n can also 
be used for such cases, without confusion, 
because they appear to bear the same se- 
mantic value. 
h generally means breathiness. This often oc- 
curs at syllable boundaries, as in nh-huh. 
Some items involve breathiness through- 
out a syllable, others involve a consonan- 
tal/h/, while others seem ambiguous be- 
tween these two. 
A single syllable-final 'h' bears no pho- 
netic value. 
tsk indicates an alveolar tongue click. These 
occur often in isolation, and occasionally 
grunt-initially 6. 
- (hyphen) indicates a fairly strong syllable 
boundary. Phonetically this means a ma- 
jor dip in energy level, a sharp disconti- 
nuity in pitch, or a significant region of 
breathy or creaky voice. 
\[repetition\] Repetition of a letter indicates 
length and/or multiple weakly-separated 
syllables. 
uu as a syllable is a special case, indicating a 
creaky schwa 
All other letters have the normal values. 
There are two things that standard En- 
glish orthography provides no way to express. 
These are expressed as annotations, following 
the basic transcription and separated from it 
by a comma. 
cr indicates creaky voice, as in yeah:er. For 
further precision numbers from 1 to 3 
can be postposed, as in :crl for slightly 
creaky and :cr3 for extremely creaky. 
{nllrnhers~ numbers after a colon indicate 
anchor points for the pitch contour, on 
the standard 1 to 5 scale. Thus uh- 
uh:~-22 is a negative response or warn- 
ing, but uh-huh:43-22 is an blatantly un- 
interested back-channel, and uh-huh:32- 
34 is the standard, polite back-channeL 
Table 3 summarizes these letter-sound 
mappings. Table 4 suggests which sounds are 
most common. 
4 Adequacy 
This scheme does fairly well by the criteria of 
§2.4. 
°There are cases where the click is followed by a 
voiced sound without any perceptible pause (with a 
delay from the onset of the click to the onset of voicing 
of 50 to 170 milliseconds). 
33 
notation \[ p\]~onetic value 
non-trivial mappings 
h a single syllable-final 'h' bears no phonetic value, 
elsewhere 'h' indicates/h/or breathiness 
nasalization, occasionally a nasal consonant (other than/m/) 
tsk alveolar tongue click 
u ~ (schwa) 
repetition of a letter length and/or multiple weakly-separated syllables 
- (hyphen) a fairly strong boundary between syllables or words 
standard mappings common in grunts 
m /m/ 
o /o/ 
a /a/ 
y /jl, as in yeah and variants 
idiosyncratic spellings 
yeah /je~/ 
kay /keI/, as in okay, ukay, llnkay, mkay etc. 
uu as a syllable, indicates a short creaky or glottalized schwa 
annotations 
:cr creaky voice (vocal fry) 
:1~5 pitch level 
Table 3: Regularized English Orthography for Conversational Grunts 
",7 
sound number 
/m/ 
nasalization 
/h/and breathiness 
clicks 
creaky voice 
/schwa/ 
/o/ 
/a/ 
56 
20 
38 
25 
53 
109 
35 
5 
Table 4: Nllmbers of grunts in our corpus 
which include the various sound components 
1. As far as clarity and usability, this 
scheme has a direct and simple mapping from 
representation to the actual phonetics. It has 
been trivial to learn and easy to use (at least 
for the author; other labelers have not yet 
been trained). 
2. As far as representational coverage, this 
scheme is adequate for some 97% (=306/317) 
of the grunts which occur in our corpus. Thus 
it is not truly complete, and labelers must 
be allowed to escape into standard lexical 
orthography (for things like oop-ep-oop and 
wow), into IPA (for eases like achh and yegh, 
palatal and velar fricatives, respectively), and 
into ad hoc notion (for cases like throat clear- 
ings and noisy exhalations). 
3. As far as precision, the scheme allows 
sumciently detailed representation; at least to 
a first appro~mation. In particular, it covers 
all known meaningful phonetic variations. It 
is, however possible that other phonetic dis- 
tinctions are also significant. For example, 
it may be that the exact height of a vowel 
34 
matters, or the exact time point at which a 
vowel starts getting creaky, or the presence 
of glottal stops, lip rounding, glottalization, 
falsetto, and so on matter, or the precise de- 
tails of pitch and energy contours matter. 
Conversely, the scheme is not over-precise: 
all the phonetic elements represented in the 
scheme appear to bear meanings (Ward, 
2000). 
Regarding unambignity, the scheme is an 
improvement but has one failing: repetition 
of a letter represents either extended duration 
or the presence of multiple syllables. As these 
two phonetic features are generally correlated, 
and the difference in meaning between them 
is anyway subtle, this may not be a major 
problem. 
5 Open Issues 
This notation assumes that the component 
sounds are categorical (except for creakiness 
and pitch), but this may in fact not be the 
case. Rather it may be that the phonetic 
components of grunts have a "gradual, rather 
than binary, oppositional character" (3akob- 
son and Waugh, 1979). This is a problem 
especially for nasalization and for vowels: it 
may be that there is an infinite number of 
slightly but significantly different variations. 
Further study is required. 
Experiments with multiple independent la- 
belers are needed to evaluate usability and 
measure cross-labeler agreement. 
Applying this notation can be complicated 
by dialect and individual differences. For ex- 
ample, the primary filler for one speaker in 
our corpus was aura. Right now it is not 
known whether this is a mere pronunciation 
variation, perhaps dialect-related, or signif- 
icantly different from urn. More study is 
needed. 
Other s also have conversational 
grunts, for example, oua/s and hien in French, 
ja and hm in German, and un, he and ya in 
Japanese (Ward, 1998), and it may be pos- 
sible to use or adapt the present scheme for 
these and other s. 

References 
W. J. Barry and A. 3. Fourcin. 1992. Levels of la- 
beling. Computer Speech and Language, pages 
1-14. 
J. Hamaker, Y. Zeng, and J. Picone. 1998. Rules 
and guidelines for transcription and segmenta- 
tion of the switchboard large vocabulary con- 
versational speech recognition corpus, version 
7.1. Technical report, Institute for Signal and 
Information Processing, Mississippi State Uni- 
versity. 
Inn Hutchby and Robin Wooflltt. 1999. Conver- 
sation Analysis. Blackwell. 
Roman Jakobson and Linda Waugh. 1979. The 
Sound Shape of Language. Indiana University 
Press. 
T. Lander. 1996. The CSLU labeling guide. Tech- 
nical Report CSLU-014--96, Center for Spoken 
Language Understanding, Oregon Graduate In- 
stitute of Science and Technology. 
Fernando Poyatos. 1975. Cross-cultural study of 
paralingulstic "alternants" in face-to-face inter- 
action. In Adam Kendon, Richard M. Harris, 
and Mary tL Key, editors, Organization of Be- 
havior in Face-to-Face Interaction, pages 285-- 
314. Mouton. 
George L. Trager. 1958. Para: A first 
approximation. Studies in Linguistics, pages 
1-12. 
Nigel Ward. 1998. The relationship between 
sound and me~nlng in Japanese back-channel 
grunts. In Proceedings of the ~th Annual Meet- 
ing of the (Japanese) Association for Natural 
Language Processing, pages 464-467. 
Nigel Ward. 2000. The challenge of non-lexical 
speech sounds. In International Conference on 
Spoken Language Processing. to appear. 
