AUTOMATED TONE TRANSCRIPTION 
Steven Bird 
University of Edinburgh, Centre for Cognitive Science 
2 Buccleuch Place, Edinburgh, EH8 9LW, UK 
Internet: Steven. Birdied. ac. uk 
Abstract 
In this paper I report on an investigation into thc 
problem of assigning tones to pitch contours. The 
proposed model is intended to serve as a tool for 
phonologists working on instrumentally obtained 
pitch data from, tone languages. Motivation and 
exemplification for the model is provided by data 
taken from my fieldwork on Bamileke Dschang 
(Cameroon). Following recent work by Liberman 
and others, l provide a parametrised F0 prediction 
fuuction ~o which generates F0 values from a tone 
sequence, and I explore the asymptotic behaviour 
of downstel,. Next., i observe that transcribing a 
sequence X of pitch (i.e. F0) values amounts to fin- 
dil~g a tone sequence T such that P(T) ~ X. This 
is a combimttorial optimisation problem, for which 
two non-deterministic search techniques are provi- 
d~d: a genetic algorithm and a simulated annea- 
Iblg algorithm. Finally, two implementations-- 
Oll,~ for each technique~are described and then 
co,npared using both artificial and real data for 
s~.quences of up to 20 tones. These programs can 
be adapted to other tone languages by adjusting 
tiw F0 predh:tion function. 
INTRODUCTION 
Tim wealth of literature on tone and intonation 
has amply demonstrated that voice pitch (F0) in 
sp,~ech is umier independent linguistic control. In 
English, w,h'e pitch alone can signal the distin- 
cthm bctwccu a st~ttement and a question. Si- 
milarly, in many tone languages,voice pitch alone 
siglmls the tense of a verb. Phonologists usually 
d,~scribe a pitclf contour nmch as they describe 
sp~ech more generally, namely as a sequence of 
discrete units (i,e. a transcription). This is illu- 
strated in Figure 1, where L indicates a low tone 
a~Jd ~.H indicates a downstepped high tone. The 
question addressed in this paper concerns how we 
should relate pitch contours to tone sequences. 
This paper is divided into four main sections, 
smnmarised in turn below. 
Tone Transcription In this section I present the 
problem of relating sequences of F0 values to 
ton~ transcriptions. I argue that Hidden Mar- 
kov Models are unsuited to the task and I de- 
monstrate the importance of having a compllta- 
tional tool which allows phonologists to experi- 
ment with F0 scaling parameters. 
Fo Scaling This section gives a mathematical 
basis for a general approach to F0 scaling which, 
it is hoped, will be applicable to any tone lan- 
guage. I derive an F0 prediction function from 
first, principles and show how the model of Li- 
berman et al. (1993) for tile Nigerian iangu:~ge 
Igbo is a special case. 
Tone and Fo in Bamileke Dschang Here I 
present some data from my own fieldwork and 
give a statistical analysis, using the same tech- 
nique used by Liberman et al. I then show how 
the general model of the previous section is in- 
stantiated for this language. This demonstrates 
the versatility of the general model, since it can 
be applied to two very different tone languages. 
Imphunentatlons This section provides two 
non-deterministic techniques for transcribing an 
F0 string. The first method uses a genetic algo- 
rithm while the second method uses simulated 
annealing. The performance of both implemen- 
tations is evaluated and compared on a range 
of artificial and real data. Finally, I give some 
examples of multiple, automatically-generated 
transcriptions of the same F0 data. 
TONE TRANSCRIPTION 
Generation and Recognition 
A prot nising way of generating contours from tone 
sequences is to specify one or more pitch tar- 
gets per tone and then to interpolate between the 
targets; the task then becomes one of providing 
a suitable sequence of targets (Pierrehumbert & 
Beckman, 1988). It is perhaps less clear how we 
should go about recognising tone sequences from 
pitch contours. Hidden Markov Models (HMMs) 
IIz 
200 
t50 
i00 
\] / 
.* ° 
":~.x.., j/'k "'~@. 
SH L ,I.H L SH L SH L SH L SH L SH L $H 1, SH I., 
m3 mb3 m3 mb~ m~ tuba nu) mb~ 1113 mb~ m~ mb~ m~ lab3 m~ ml)3 m3 lab:) 
Figure 1:F0 Trace for Bamileke Dschang Utterance: 'child and child and ... ' 
(Huang et al., 1990) offer a powerful statistical ap- 
proach to this problem, though it is unch:ar how 
they could be used to rccognise the units of in- 
terest to phonologists, ttMMs do not encode ti- 
ming information in a way that would allow them 
to output, say, one tone per syllable (or vowel). 
Moreover, the same section of a pitch contour may 
correspond to either H or L tones. For example, 
a H between two Hs looks just like an L between 
two Ls. There is no principled upper bound on 
the amount of context that needs to be inspec- 
ted in order to resolve the ambiguity, lea(ling to 
a multiplication of state information required by 
the HMM and problems for training it. 
In the present context, the emphasis is not 
on automatic speech recognition but on a tool to 
support phonologists working with tone. As we 
shall see in the next section, once the phonologist 
has identified the salient location to measure the 
'F0 value' of a syllable (or some other phonologi- 
cal unit), the task will be to automatically map a 
string of these values to a string of tones. 
A Tool for Phonologists 
Connell and Ladd have devised a set of heuristics 
for identifying key points in an F0 contour to re- 
cord F0 values (Connell & Ladd, 1990, 21If). In 
the absence of a program which enshrines these 
heuristics, it was decided to develop a system for 
producing a tone transcription from a sequence of 
F0 values. Apart from the obvious benefits of au- 
tomating the process, such as speed and accuracy, 
it ~'ould show up cases where there is more than 
one possible tone transcription, possibly with dif- 
ferent parameter settings for the F0 scaling fun- 
ction. Having the set of tone transcriptions that 
are compatible with an utterance has consideral,le 
value to an analyst, searching for invariances in I.he 
tonal assignments to individual morphenaes. 
To exemplify this point, it is worth consktering 
a recent example where an alternatiw~ transcrip- 
tion of some data proved valuable in providing a 
fresh analysis of the data. In their analyses of tone 
in Bamileke Dschang, Hyman gives tile transcrip- 
tion in (la) while Stewart gives the one in (lb), 
for the phrase meaning machete of dogs. 
(1) a. flJai mSmSbhd -- (Hyman, 1985, 50) 
b. J~Jai't' SmSmbh6- (Stewart, 1993, 2(10) 
These two possibilities exist because of different F0 
scaling parameters. These parameters deternfine 
the way in which the different tones are scaled 
relative to each other and to the speaker's pitch 
range. This is illustrated in (2), adapting Hyman's 
earlier notation (Hyman, 1979). 
(2) a. Hyman: flJli m~m,l.bhti 
.fl pl f mo $ mbhfi 
L L H L ./. It 
3 3 l 3 1 
0 0 0 0 1 1 
3 3 1 3 2 
b. Stewart: ~lpi't" SmSmbh4 
pl J" f $ mb mbhfi 
L L "t H .1. L H 
2 2 1 2 l 
1 1 0 0 1 t I 
3 3 1 3 2 
Example (2) displays a kind of phonetic inter- 
pretation function. Immediately below the two 
rOWS of tOllC'S we see a row of inllnbers correspon- 
ding to the tones. For Hyman, L=3 and H=I, 
while for Stewart, 1,=2 and H=I. Observe in Hy- 
nllUi'S example that a rising tone.--synlbolised by a 
wedge abow: the i .--.is modelled as all btl scquencl: 
in keeping with standard practice in African tone 
analysis. 
The second row of numbers corresponds to do- 
wnstep (.1.) and upstep ('1"). For Hymart's model, 
this row begins at 0 and is increased by 1 for each 
downstep encountered. For Stewart's model, this 
row begins at. 1 and is increased by 1 for each do- 
wnstep encountered and decreased by 1 for each 
upstep encountered. The two rows are summed 
vertically to give the last row of numbers. Ob- 
serve that the last rows of Stewart's and Hyman's 
models are identical. 
The parameter which distinguishes the two 
approaches is partial vs. total downstep. Hyman 
treats Dschang as a partial downstep language, 
i.e. where .I.H appears as a mid tone (with respect 
to the material to its left). Stewart treats it as a 
total downstep language, i.e. where ~H appears as 
an I, tone (with:respect to the material to its left). 
While Hyman and Stewart present rather dif- 
ferent analyses of rather different looking tran- 
scriptions, we can see that they are really analy- 
zing the same data, given the above interpretation 
function. Therefore, phonologists who do not wish 
to limit themselves to the transcriptions which re- 
suit from certain parameter settings in the pho- 
netic interpretation function would be better off 
w,,rkiug directly with number sequences like the 
last row in (2). This paper describes a tool which 
lets them do just that. 
Fo SCALING 
C, onsider again the F0 contour in Figure 1. In 
particular, ilote that the F0 decay seems to be to 
a non-zero asymptote, and that H and L appear to 
have different asymptotes which we symbolise as h 
and I respectively. These observations are clearer 
in Figure 2, which (roughly speaking) displays the 
peaks and valleys from Figure 1. 
Although this is admittedly a rather artificial 
ex:unple, it remains true that there is no princip- 
h,,I upper limit ou the number of downsteps that 
C;i.II oCcllr in an utterance (C.\]eluents, 1979, 540), 
lul, I so the a.sytnptotic behaviour off Fll scaling still 
IIC,'ds I.o I)c addressed. 
NOw Sul)pose tllat we have a sequence T of 
t(mcs where ti is the ith tone (H or L) and a se- 
quence X of F0 values where xi is the F0 value 
corresponding to ti. Then we would like a formula 
which predicts xi given xi-1, ti and ti-x (i > 1). 
We express this as follows: 
Hz 
200 
150 
100 
v ~ 
I I,i';¢IENI) I 
--=11 
Figure 2: Asymptotic Behaviour of F0 
• i = P,,-,,,(~i-d 
The question, now, is what should this function 
look like? Suppose for sake of argument that the 
ratio of L to the immediately preceding tt in Fi- 
gure 2 is constant, with respect to the baselines 
for H and L, namely h and I. Then we have: 
xi -- l 
-- C xi-x - h 
More generally, suppose that we have a sequence 
of two arbitrary tones. Ignoring the possibility of 
downstep for the present, we have a static two- 
tone system where HH and LL sequences are level 
and sequences like HLHLHL are realised as simple 
oscillation between two pitches. We can write the 
following formula, where \[i = h if tl = H and 
ti = l ifti = L. 
Xi -- ti 
Xi-1 -- ti-1 
xi -- t'i--1 'Xi--1 
The situation becomes more interesting when we 
allow for downdrift and downstep. Downdrift is 
the automatic lowering of the second of two H to- 
nes when an L intervenes, so HLH is realised as 
\[--\] rather than as \[-_-\], while downstcp is the 
lowering of the second of two tones when an inter- 
vening l, is lost, so HI.H is rea.lised as \[ \] (llyman 
& Schuh, 1974). Bamileke l)schang has downstep 
but m>t downdrift while lgbo has downdril't but 
only wiry limited downstep. Now we deline ti = h 
iftl --I\[, ,IH and ii = l ifti =L, ,I.L. Generalising 
our equation once more, we have the following, 
where R is a factor called the transition ratio. 
zi -/i /'i R $- ti-ltl 
Xi--1 -- ti--1 ~i--1 
Zl : ~ti_,ti(Xi--1) -- --Rti_,tl.xi-1 ti-1 
+ ti(1 - Rt,_,t,) 
Now I shall show how this general equation relates 
to the equations for \[gbo (Liberman et al., 1993, 
151 ), reproduced below: 
(3) HH xi = xi- 1 
HL xi = (l"l/h)xi_l + l(1 - F) 
LH xi = (h./l)xi-1 
LL xl = Fxi-1 + l(1 - F) 
ItSH xi = Oxi-I + h(1 - D) 
P can be instantiated to the set of equations in 
(3) by setting R as follows: 
ti 
t~_~ H L SH \] 0<F<l 
I H' $I~ 1 Fl F D_ I 0<D<I 
It will be helpful to introduce one more level 
of generality. P relates adjacent F0 values, but 
we would also like to relate non-adjacent values, 
given the sequence of intervening tones. Suppose 
that T = t0 • - • t,~ is a tone sequence where the F0 
value of to is x. Then we shall write the F0 value 
of tn as PT(X). By repeated applications of'P we 
can write down the following expression for 'PT: 
"Pr(x) = ~RT.X ÷t.(1 - RT) 
where RT = YI~=i Rtk_~th, n > 2. Now, suppose 
that S = so"'sm and T = to".tn are tone se- 
quences and that s0 =/0, .sin = t'n and T~.s = T~T. 
Then it is straightforward to show that Rs = RT. 
Notice also that if 7~T(X) = x for all x and if 
f0 = t-~ then RT = 1. These results will be useful 
in the next section. 
Finally, it is worth comparing ~ with Hyman's 
and Stewart's interpretation functions which were 
illustrated in (2). As pointed out already, Hy- 
man's is a partial downstep model while Stewart's 
is a total downstep model. Partial and total down- 
step can be visualised as follows, where the dotted 
lines indicate the abstract register inside which to- 
nes are scaled, and where downstep corresponds to 
lowering of the register. 
Partial downstep Total downstep 
: n 
....... : H - : H 
:. ............. 
Observe that for partial downstep, it. is necessary 
to have two downsteps before a high tone is at 
the level of a preceding low, while for total do- 
wnstep, it is only necessary to haw, a single do- 
wnstep for a high tone to be at the same level as 
the preceding low. We can express these obser- 
vations about partial and total dowustep in the 
model as follows. For partial downstep, we have 
'Pt.$tt4U(Z) = x while for total downste i) we have 
'PL~.H(X) = x. For both of these equations we :ire 
forced to have h = I which does not semn to be em- 
pirically justifiable in view of the data in Figur, l. 
It might be argued that this indicates a flaw iu I.he 
model being presented here, since partial and total 
downstep are widely attested in the literature on 
tone languages. Unfortunately, it is not possible 
in general to provide a model for partial or total 
downstep which permits distinct asymptotes for It 
and LJ Therefore, to the extent that Figure I is 
typical of tone languages in having dilferent H a.d 
L asymptotes, one must conclude that total and 
partial downstep are qualitative tern,s only. Ihr- 
wever, they may yet re-emerge in the ,nodel under 
a different guise, as we shall see later. 
The effect of the distinction between partial 
and total downstep is to allow different transcrip- 
tions of the same string, as we saw in (2). In 
general, we have the following mappiug between 
transcriptions under the two views of downstep: 
(4) partial total 
HH -- HH 
HL -- HAL 
LH -- LtH 
LL ~ LL 
H.IJi -- H.I.H 
partial total 
L~H ~- LH 
L.I.L - - L.I. I, 
HtH H'tH 
HtL -- HI, 
LtL LtL 
It is clear that changing from one view of down- 
step to the other amounts to adding and deleting $ 
and t while leaving the tones themselves unchan- 
ged. Thus, the model admits both transcription 
schemes that result from the two views of down- 
step, and another besides, as shown later in (7). 
This concludes the discussion of the F0 pre- 
diction function. In the next section i shall inve- 
stigate the phonetic interpretation of tone in Ba- 
mileke Dschang, and determine the values of R for 
this language. 
tTo see why this is so for the case of total down- 
step, suppose that such a model did exist, and so I < h. 
Let x E \[1, h), a valid F0 value for a low tone. Now, 
whatever interpretation function 'P' we use, wc still 
require that "PL4H(X) = x by definition of total down- 
step, which means that there is now a high tone with 
a F0 value less than h. But h is tile asymptote below 
which no high tones should ever be realised, and so we 
have a contradiction. The case for partial downstep 
follows similarly. 
TONE AND Fo IN 
BAMILEKE DSCHANG 
In a recent fiekl trip to Western (',ameroon to 
study the Bamileke Dschang ~ noun associative 
construction, I was able to collect a small amount, 
of data relating to F0 scaling throughout a par- 
ticular informant's pitch range. Following Liber- 
man et al., voice pitch was varied by getting the 
informant to speak at different volmnes and by 
adjusting the recording level appropriately. Ho- 
wever, rather than asking the informant to ima- 
gin,: speaking to a subject at different distances, 
I controlled the volume by having the informant 
wear headphones and played white noise from a 
detuned radio. Thus, I could set the informant's 
voice pitch by using the volume control on my ra- 
dio. My hypothesis is that this technique produces 
more consistent volume (and hence, pitch scaling) 
over long utterances and may make informants less 
self-conscious about speaking loudly than simply 
asking them to imagine speaking to subjects at 
various distances away. Measurements were taken 
from the following data. 
(5) HH d 3u5 sS1) t6 VI~U:5 t5 o t6 n3t~5 tdO t6 
nSu3 kd.p t¢~ nStt3 kip 
He s,w the bird before, he saw the hat before 
he saw the b~r~k'et before he saw the pipe 
before he saw the cup 
LL ~tp/lk -- side, half 
L~LH, HL 
~5 rob5 ~$s5 mb5 ... ~5 
jealousy and jealousy and ... jealousy 
15~p5 mb5 155p5 mb5 ... 15.l.pa 
breast and breast and ... breast 
mb.l.vt~t rob5 mbSvt~t rnb5 ... mbSv~t 
oil and oil and ... oil 
$m5 rob5 ,~m5 rob5 ... $m5 
child and child and ... child 
II,egrettably, the LL data was only available 
fr, ,n isolat,,, I disyllal des, and other sequences such 
a.~ IAI and 115It were not available at all. ttowever, 
from the F0 data for the above utterances we can 
hypothesise the behaviour of these unseen sequen- 
ces, and this can be tested in subsequent empirical 
w,,rk. The r,'sults for utterances involving HH and 
LI, sequences are displayed in Figure 3, while re- 
suits for L.~II and HI, are displayed in Figure 4. 
The regression equations obtained from these 
data are displayed in (6), where the number of oc- 
2Bamileke Dschang is a grassfields Bantu language 
spoken in the Western Province of Cameroon. The 
name 'Bamileke' (pron: \[ba'mileke\]) represents both 
au e~,hnic grouping and a language cluster; Dschang 
(pron: \[tfmJ\]) is an important t.own around which one 
of the Bamileke languages is spoken. The data here is 
from the Balbu dialect. 
x,(Hz) 
200 
150 
100 
I,EG END .~-o-- 
o = I!11 o 1 
• = LL °~oo~° 
Q 0 0 
100 150 200 Xi_l(HZ ) 
Figure 3: Plot of x/-1 vs x/for HH, LL 
200 
150 
100 
LEGEND ~ ~/ 
.=HL ~ 
100 150 200 z/_l(Hz) 
Figure 4: Plot of x/-1 vs x/ for L~H, HL 
currences of each tone sequence is given in paren- 
theses after the sequence. The third column gives 
the standard error for the gradient and intercept. 
(6) 'l'one Regression Standm'd 
Sequence Equation l_~\]rror 
I ill (119) x~ = 0.99xi_l + 0.91 0.012, 5.0 
I,L (ll) x, = 1.02xi_\] - 1.39 0.057, 3.6 
IlL (40) x~ = 0.65x~_~ + 25.0 0.015, :3.1 
I,~H (38) x~ = 1.10xi_~ + 0.54 0.026, 4.3 
From this, we conclude that. HL is the only se- 
quence with an intercept significantly different 
from zero, and that x{ = x{-1 for HH and LL 
sequences. We also conclude that .RHH : .RLL = 
RL.tH = 1, (l/h = 1.1) and RHL = 0.72. This last 
value will be referred to as the quantity d. We 
also see that I -- 88Hz and h = 96Hz. Fortuna- 
tely, these figures are sufficient to determine the 
R values for all other pairs of tones in Barnileke 
Dschang. 
A further observation is that Bamileke 
Dschang does not have downdrift, and so there 
is no F0 difference across HLH and LHL sequen- 
ces. This is evident in Figure 5. Therefore, we 
can write PHLH(X) = X, and by a result we sho- 
wed above, RHL.RLH = 1. Given that RHL = d it 
follows that RLH = ~. 
Concerning downstep, I shall assume that the 
magnitude of downstep is independent of the tones 
on either side, and so ~OHL4H = 'PH$H ---- "\])LSL ---- 
~LII.I.L. A separate instrumental study supports 
this hypothesis t(Bird & Stegen, 1993). Therefore, 
we lave l~st = 7Pt,s.Lt -- dRstt, where s is any tone 
and t is 1I or L. 
Finally, it is itnportant to briefly consider up- 
step, since it has been used in some analyses of 
Banfileke Dschang (e.g. Stewart's). Given that up- 
step and downstep are intended as inverses of each 
other, we have the identities 79~4t,rt = "Pat = P~'rt~.t, 
with ~, t as before. We now have a complete table 
for R: 
ti 
ti-1 H L SH SL I"H TL I 
n,$n,~n 1 d d d z d -~ 1 ! 
L,$L,~L d -1 1 1 d d -2 d -1 
Observe the symmetries in this table. The confi- 
guration of four R values that we find when ti is 
not downstepped or upstepped (the first two co- 
hmms) is reproduced in the columns for downstep 
(multiplied by d) and in the columns for upstep 
(divided by d). 
Note also that the above table is dependent 
upon how the data in (5) was transcribed. Sup- 
pose that we had not used repetitions of HLSH 
(a transcription scheme based on partial down- 
step) but HSLH (a scheme based on total down- 
step). Then we would have had RH4L = d and 
/'~.LH ---- 1. Accordingly, the table for R would be 
as follows: 
ti-t H L $H SL tH TTL 
H, SH, I"H 1 1 d d d -1 d -1 
L, SL, tL 1 1 d d d -1 d -x 
The fact that we have two possible tables for 
R is no cause for alarm. Recall that the transition 
between two tones ti-1 and ti also involves the 
factor {i/\[i-x. This factor is manifested in tone 
transitions according to the following pattern: 
ti 
ti- 1 H L SH SL tH I"L 
II, SH, I"H 1 l/h 1 l/h 1 l/h 
L, SL, ~L h/l 1 h/l 1 h/l 1 
I therefore conclude that the presence of more 
than one table for R indicates an interplay bet- 
ween R values and the ratio h/l. This raises an 
interesting question. Suppose we have two tone 
sequences T = t0...t, and 7 '~ = t~...t~, and two 
interpretation functions "it:' and P' based on R and 
R ~ respectively. Then under what circumstances is 
the phonetic interpretation of both sequences the 
same under their respective interpretation fimc- 
tions? A sufficient condition for them to be the 
same is that \[i tr~ and that Rt~_,t, = = R'q_,,:. 
The reader can check that these conditions are 
met by the mapping in (4) and the two tables fi:)r 
R given above. Note that this observation h,,hls 
for the model in general, not just for the specia- 
lised version of the model as applied to Bamih'ke 
Dschang. 
It can also be shown that R is completely de- 
termined once RHL is specified. A possible charac- 
terisation of total vs. partial downstep now arises: 
if RHL = 1 then we have total downstep, but if 
RHL = d < 1 then we have partial downstep. 
However, the interpretation of these terms must 
necessarily be different from the standard inter- 
pretation, since I have shown that the standard 
interpretation is not compatible with the present 
model. 
This concludes the discussion of F0 scaling in 
Bamileke Dschang. I shall now present the imple- 
mentations. 
IMPLEMENTATIONS 
In this section, I show how it is possible to get 
two programs to produce a sequence of tones T 
(i.e. a tone transcription) given a sequence of n 
F0 values X. The programs make crucial use of 
the prediction function "P in evaluating candidate 
tone transcriptions. 
Both programs involve search, and in general, 
the aim in searching is to discover tile values for 
xl, ..., xn so as to optimise the value of a specified 
evaluation fimction f(xl,...,xn). When f has 
many local optima, deterministic methods such as 
hill-climbing perform poorly. This is because they 
terminate in a local optimum and the particular 
one found-depends heavily on the starting point in 
the search, and there is usually no way of choosing 
a good starting point. 
Exhaustive search for the global optimum is 
not an option when the search space is prohibi- 
tively large. In the present context, say for a 
sequence of 20 tones, the search space contains 
6 ~° ~ 10 is possible tone transcriptions, and for 
each of these there are thousands of possible pa- 
rameter settings, too large a search space for ex- 
haustive search in a reasonable amount of compu- 
6 
150 
100 
"X 
% 
J 
H L 
so~ mb~ 
H L H L H L H 
soo mb~ sol 3 mb:) soo mba say 
Figure 5:F0 Trace for 'bird and bird and ... ' 
L H 
mba s~ 
ration time. 
Non-deterministic search methods have been 
devised as a way of tackling large-scale combinato- 
rial optimisation problems, problems that involve 
fin(ling optima of functions of discrete variables. 
'I'hcse methods are only designed to yield an ap- 
proximate solution, but they do so in a reasona- 
ble amount of computation time. The best known 
such methods are genetic search (Goldberg, 1989) 
and annealing search (van Laarhoven & Aarts, 
1987). Recently, annealing search has been suc- 
cessfully applied to the learning of phonological 
constraints expressed as finite-state automata (El- 
lison, 1993). In the following sections I describe a 
genetic algorithm and an annealing algorithm for 
the tone transcription problem. 
A Genetic Algorithm 
For a cogent introduction to genetic search and an 
explanation of why it works, the reader is referred 
to (South et al., 1993). Before presenting the ver- 
sion of the algorithm used in the implementation, 
! .~hall informally define the key data types it uses 
ah,ng with tim standard operations on those types. 
g,,ne A line;at encoding of a solution. In the pre- 
sent setti,Lg, it is an array of n tones, where each 
tone is oim of H, SH, TH, L, SL or tL. A gene 
also contains 16 bit eucodings of the parameters 
h, l and ,I. These encodings were scaled to be 
floating i)oint numbers in the range \[90,110\] for 
/,, \[70, I0,)\] for t and \[0.6, 0.9\] for d. 
gene pool An array of genes, P. One of the see- 
arch parameters is the size of P, known as the 
population. The gene pool is renewed each gene- 
ration, and the number of generations is another 
search parameter. 
evaluation A measure of the fitness of a gene as 
a solution to the problem. Suppose that X is 
the sequence of F0 values we wish to transcribe. 
Suppose also that T is a particular gene. The 
the evaluation function is as follows: " 
 x(T) = ! - x,? 
n /--2 
crossover This is an operation which takes two 
genes and produces a single gene as the result. 
Suppose that A = al"-an and B = bl...b,. 
Then the crossover function Cr is defined as fol- 
lows, where r is the (randomly selected) crosso- 
ver point (0 < r < n). 
Cr(al . . .arar+l " "a,~,bl " "brbr+l " "bn) 
-- al " "arbr+l " "bn 
In other words, the genes A and B are cut at 
a position determined by r and the first part of 
A is spliced with the second part of B to create 
a new gene. Crossover builds in the idea that 
good genes tend to produce good offspring. To 
see why this is so, suppose that the transcrip- 
tioln contained in tile first part of A is relatively 
good while the rest is poor, while the trallscrip- 
tion contained in the first part of B is poor and 
the rest is relatively good. Then the off,spring 
containing the first part of A and the second 
part of B will be an improvement on both A 
and B; other possible offspring from A and B 
will be significantly worse and may not survive 
to the next generation. The program performs 
this kind of crossover for the parameters h, l 
and d, employing independent crossover points 
for each, and randomising the argument order 
in C',. so that the high order bits in the offspring 
are equally likely to come from either parent. 
An extension to crossover allows more than one 
crossing point. The current model permits an 
arbitrary number of crossing points for crossover 
on the transcription string. The resulting gene 
is optimal since we choose the crossing points in 
such a way as to rninimise (~ti_lti(Xi-1) -- Zi) 2 
at each position. In developing the system, ex- 
ploiting the decomposability of the ewduation 
fimction in this way caused a significant impro- 
vement in system performance over the version 
which used simple crossover. 
breeding For each generation, we create, a new 
gene pool from the previous one. Each new gene 
is created by mating the best of three randomly 
chosen genes with the best of three other ran- 
domly chosen genes. 
mutation In order to maintain some genetic di- 
versity and an element of randomness throug- 
hout the search (rather than just in the initial 
configuration), a further operation is applied to 
each gene in every generation. With a certain 
probability (known as the mutation probability), 
for each gene T and each tone in T, the tone 
is randomly set to any of the six possible tones. 
Likewise, the parameter encodings are mutated. 
The mutation rate is set to 0.005 but raised to 
0.5 for a single generation if the evaluation of the 
best gene is UO improvement on the evaluation 
of the best gene ten generations earlier. Thc 
best gene is never mutated. 
The building blocks of genetic search discus- 
sed above are structured into the following algo- 
rithm, expressed in pseudo-Pascal: 
procedure genetic_search 
begin 
initialise Pool, NewPool; 
for g := 1 to generations do 
begin 
if good_performance(10) then 
mutation_rate := (}.005; 
else 
mutation_rate := 0.5; 
NewPool\[1\] := find_best_gene(Pool); 
for n := 2 to population do 
begin 
genel := best_of_three(Pool); 
gene2 := best_of_three(Pool); 
NcwPool\[n\] := crossover(genel, geue2); 
mutate(NewPool\[n\], mutation_rate); 
end 
Pool := NewPool; 
evaluate (Pool); 
eud 
write find_best_gene(Pool); 
end 
The main loop is executed for each generation. 
EaCh time through this loop, the program checks 
performance over the last ten generations and if 
performance has been good, the mutation rate 
stays low, otherwise it is changed to high. Then 
it copies the best gene to the new pool. Now we 
reach the inner loop, which selects two genes, per- 
forms crossover, and mutates tim result. Next, the 
current pool is updated, an evaluation is perfor- 
med, and the program continues with the next ge- 
neration. Once all the generations have been com- 
pleted, the program displays the best gene from 
the final population and terminates. 
An Annealing Algorithm 
As with genetic algorithms, simulated annealing 
(van Laarhoven & Aarts, 1987) is a combinatorial 
optimisation technique based on an analogy with 
a natural process. Annealing is the heating and 
slow cooling of a solid which allows the formati,m 
of regular crystalline structure having a mininu,n 
of excess energy. In its early stages when the tem- 
perature is high, annealing search rcsembles ran- 
dom search. There is so much free euergy in the 
system that a transition to a higher energy state 
is highly probable. As the temperature decreases 
the search begins to resemble hill-climbing. Now 
there is much less free energy and so transitions 
to higher energy states are h'ss and loss likely. In 
what follows, I explain some of the I)arameters of 
annealing search as used in the curreut implemen- 
tation. 
temperature At the start of the search the tem- 
perature, t is set to 1. During the search, the 
temperature is reduced at a rate set by the 'cocr- 
ling rate' parameter, until it reaches a valne loss 
than 10 -¢ . 
perturbation At each step of the search, the cu r- 
rent state is perturbed by an amount which de- 
pends on the temperature. The temperature de- 
termines the fraction of the search space that 
is covered by a single perturbation step. For 
a tone sequence of length n, we randomly reset 
the worst n..t tones according to (Pt,_,t~ (xi-I)- 
xi) 2. For the parameters we proceed as tbi- 
lows, here exemplified for h. First, set p = 
t(hma×-hmi~). Now, add to h a random number 
in the range \[-p, p\] and check that the result is 
still in the range \[h,nin, hmax\]. 
equilibrium At each temperature, the system is 
required to reach 'thermal equilibrium' before 
the temperature is lowered. In the present con- 
text, equilibrium is reached if no more than one 
of the last eight perturbations yielded a new 
state that was accepted. 
free energy function This is the amount of 
available energy for transitions to higher energy 
states. In the current system, it is the distribu- 
tion -lO00.t.log(p), where p is a uniform ran- 
dom variable in the range (0, 1\]. If the energy 
difference A between an old and a new state is 
less than the available energy, then the transi- 
tion is accepted. The factor of 1000 is intended 
to scale the energy distribution to typical values 
of the evaluation function. 
Now the algorithm itself is presented: 
procedure annealing_search 
begin 
initialise Trans, NewTrans, BestTrans; 
randomise Trans; 
t := 1; 
while t > 0.000001 do 
begin 
repeat 
New'lh'ans := perturb(Trans, t); 
A := evaluate(NewTrans) 
- evaluate(Trans); 
if A < 0 or 
exp(-A/1000.t) > random(0,1) then 
Trans := NewTrans; 
if evaluate(Trans) < evaluate(BestTrans) 
BestTrans := Trans; 
until equilibrium_reached; 
Trans := BestTrans; 
temperature := temperature / 1.2; 
end 
write Trans; 
end 
The program is made up of two loops. The ou- 
ter loop simply iterates through the temperature 
range, beginning with a temperature of 1 and stea- 
dily decreasing it until it gets very close to zero. 
The nested loop performs the task of reaching 
thermal equilibrium at each temperature. The 
first step is to perturb the previous transcription 
to make a new one. Notice that the temperature t 
is a parameter of the perturb function. Next, the 
difference £x between the old and new evaluations 
is calculated. If the new transcription has a bet- 
ter evaluation than the old one, then £x is negative. 
Next, the program accepts the new transcription 
if (i) A is negative or (ii) A is positive and there 
is sufficient free energy in the system to allow the 
worse transcription to be accepted. Finally, we 
check if the new transcription is better than the 
best transcription found so far (BestTrans) and if 
so, we set BestTrans to be the new transcription. 
Once equilibrium is reached, the current transcrip- 
tion is set to be the best transcription found so far, 
and the search continues. 
Performance Results 
Both the genetic and annealing search algorithms 
have been implemented in CA-+. In this section, the 
performance of the two implementations is compa- 
red. Performance statistics are based on 1,200 exe- 
100 
80 
60 
40 
20 
5 10 15 20 
Figure 6: Performance results (no upstep) 
cutions of each program. Search parameters were 
set so that each execution took around 5 seconds 
on a Sun Sparc 10. Three performance trials were 
undertaken. 
Trial 1: Artificial Data. In the first trial, 
both programs generated random sequences of to- 
nes, then computed the corresponding F0 sequence 
using P, then set about transcribing the F0 se- 
quence. Since these sequences were ideal, the best 
possible evaluation for a transcription was zero. 
The performance of the programs could then be 
measured to see how close they came to finding 
the optimal solution. Each program was tested on 
F0 sequences of length 5, 10, 15 and 20. For each 
length, each program transcribed 100 randomly- 
generated sequences. The results are displayed in 
Figure 6. Each pair of bars corresponds to a given 
transcription length. The left member of each pair 
is for the genetic search program, while the right 
member is for the annealing search program. 
The heavily shaded bars corresponding to eva- 
luations less than 1 are the most important. These 
indicate the number of times out of 100 that the 
programs found a transcription with an evalua- 
tion less than 1. This evaluation means that the 
average of the squared difference between the pre- 
dicted F0 values and the actual F0 values was 
less than 1Hz. Observe that the annealing search 
program performs significantly better in all cases. 
Note that the mutation operation in the genetic 
search program treats each bit in the parameter 
encodings equally, while the perturbation opera- 
tion in the annealing search program is sensitive 
to the distinction between more significant vs. less 
significant bits. This may explain the better con- 
vergence behaviour of the annealing search. 
Notice also in Figure 6 that performance 
lOOL • <o., [] <, 
[] <10 [] <100 
100I • <4 
[] <7 [] <10 [] <20 
80 
80 
60 
60 
40 
40 
20 
20 
5 10 15 20 
Figure 7: Performance results (upstep) 
1 2 3 4 
Figure 8: Performance results for actual data 
does not degrade with transcription length as the 
length doubles from 10 to 20. This is probably be- 
cause a randomly generated sequence will contain 
downsteps on every second tone (on average) cau- 
sing a general downtrend in the F0 values and se- 
verely limiting the combinatorial explosion of pos- 
sible transcriptions. 
Trial 2: Artificial Data with Upstep. Trial 
2 was the same as trial 1 except that this time 
upstep was permitted as well. The results are dis- 
played in Figure 7. Again the annealing program 
fares better than the genetic program. Consider 
again the bars corresponding to evaluations less 
than 1. For both programs, however, observe that 
the performance degrades more uniformly than in 
trial 1, probably because the inclusion of upstep 
greatly increases the number of possible transcrip- 
tions (and hence, the number of local optima). 
Trial 3: Actual Data. The final trial invol- 
ved real data, including data from the utterance 
given in Figure 1. This trial involved four sub- 
trials. The first and second had F0 sequences of 
length 10, while the third and fourth had length 
18 and 19. The first and second sequences were 
taken by extracting the initial 10 F0 values from 
the third and fourth sequences, thereby avoiding 
the asymptotic behaviour of the longer sequences. 
The data is tabulated below, and it comes from 
the sentences in (5). 
Trial F0 sequence 
1 219,168,183,150,160,136,144,123,131,I 15 
2 205,224,16'7,200,156,175,136,156,127,140 
3 219,168,183,150,160,136,144,123,131,115, 
122,107,113,105,118,100,113,95 
4 205,224,167,200,156,175,136,156,127,140, 
118,129,109,119,103,120,102,111,95 
Performance results are given in Figure 8. Notice 
that the interpretation of the shading in this figure 
is different from that in previous figures. This is 
because evaluations near zero were less likely with 
real data. In fact, the annealing program never 
found an evaluation less than 3 while the genetic 
program never found an evaluation less than 4. 
Since the programs performed about equally 
on finding transcriptions with an evaluation less 
than 7, I shall display these transcriptions along 
With an indication of how many times each 
program found the transcription (G = genetic, 
A = annealing). I give transcriptions which occur- 
red at least twice in one of the programs, during 
100 executions of each. 
Trial 1: Transcriptions G A 
HSLSHSLSHSLSHSLSHSL 27 37 
HSLSHSLSHLSH.I.LSHL 7 0 
HSLSHLSHLSHL,IML 3 0 
HSLHSLHSLH,~LHSL 2O 2 
HLSHL,IMLSHLSHL 24 39 
Trial 2: Transcriptions G A 
LSHSLHSLSH$LSH.I.LSH 5 0 
L.I.H,~LHSLSHSLHSLSH 66 54 
Trial 3: Transcriptions G A 
H.I.L.I, HSL.IMLSHSLSHL~HSLSHLH~LH%L Ii 0 
HSLSHLSHLSHLSHLSHSLSHLHSLHSL 1 2 
HSLSHLSHLSHLSHLSHLSHLHSLHSL I0 14 
HLSHLSHLSHLSHLSHLSHLHSLHSL 30 56 
Trial 4: Transcriptions G A 
LSHSLHSLSHSLHSLSHSL+HSLSHSLHSLSHSL 60 29 
L.I.HSLHSLSHSLHSLSHLSHSLSHSLHSLSHSL 5 19 
LSHSLHSL.I.H$LH.I.LSHLSHSLSHLHSLSHSL 7 7 
LSH,I.LHSLSHSLHSLSH$LSHSL+HLHSLSHSL 0 4 
LSHSLHSLSHSLH,I.LSHLSHLSHSLHSL.~HSL 0 3 
LSHSLHSLSHSLHSLSHLSHLSHLHSLSHSL 0 6 
The results from trial 1 deserve special attention. 
In trial 1, three transcriptions were found by both 
programs. The best evaluations found are given 
below: 
10 
II I,'~.H 1,4H L4H LLII L 
I141, II4L l14L 11~.t, H4L 
HI.i,41141,.IAI41,4tI41,$tISL 
E: 3 h: 107 1:100 d: 0.68 
E:4 h:90 1:93 d:0.76 
E: 3 It: 107 l: 100 d: 0.82 
It is striking to note that the first two transcrip- 
tions above are what Hyman and Stewart (respec- 
tively) would have given as transcriptions for the 
abstract F0 sequence 1 324354657. This is 
(temoustrated in (7a,b). The third transcription 
points to another possibility, given in (7c). 
(7) a. Hyman's transcription scheme 
H L .I.H L ,I.H L ~II L J(H L 
1313131313 
0 0 1 1 2 2 3 3 4 4 
I 3 2"4 3 5 4 6 5 7 
I). Ste~wal't's transcription scheme 
H ,1,1, ll: ,I,L H SL H SL tI J,I, 
121 2121 212 
0112233445 
1 3 2 4 3 5 4 6 5 7 
c. Novel transcription scheme 
H SL SH SL SU SL 4H .I.L SH SL 
1 ~ 1 ~ 1 ~ 1 ~ I ~'2 
9 o½1 2 3 4  
1 3 2" 4 3 5 4 6 5 7 
Therefore, there are encouraging signs that 
the program is living up to its promise of produ- 
cing alternative, equally acceptable transcriptions, 
a.~ desired from an analytical standpoint. 
Multiple Solutions 
All,hougJt we have seen more than one transcrip- 
tion I'or a giwm !"0 sequence, it is inconvenient to 
I)o required to run the programs several times in 
order to see if more than one solution can be fo- 
und. Furthermore, the programs are designed not 
to get caught in local optima, which is a problem 
since interesting alternative transcriptions may ac- 
tually be local optima. Therefore, both programs 
are set up to report the k best solutions, where the 
user specifies the number of solutions desired. The 
program ensures that the same area of the search 
space is not re-explored by subsequent searches. 
This is done by defining a distance metric on tran- 
scriptions which counts the number of tones in one 
tra.nscription that have to be changed in order to 
make. it, identical to the other transcription. That 
pa.rt of the search space within a distance of n/3 
I'rom any I)reviously found solut.ion is not explored 
again. The lu'ograms give up before linding k so- 
lutions if 5 randomly generated transcriptions all 
fidl within distance n/3 of previous solutions. 
Now, consider the following randondy genera- 
ted sequence of tones: 
201 215 20l 173 163 201 173 d: 0.87 : 
The annealing program was set the task of fin- 
ding ten transcriptions of this tone sequence. The 
program was run only twice, and it reporte(I the 
following solutions with evaluations less than or 
equal to 1. Both runnings of the program found 
the same solutions, and in the same order. (Note 
that two transcriptions are taken to be the same if 
one or both begin with an initial upstep or down- 
step; this has no effect on the phonetic interpreta- 
tion). In the following displays, the predicted F0 
values are given below each solution to facilitate 
comparison with the input sequence. 
• I.H TH .I.H L J~L tH L 
201 215 201 172 163 201 172 
.I.H I"H SH tL SL n ~L 
201 215 201 174 163 201 174 
L SH .I.H L SL tn L 
201 217 201 174 163 201 174 
h:101 1:92 
d: 0.88 £: 0.20 
h : 109 1:94 
d: 0.87 £: 0.23 
h : 105 l: 97 
d : 0.86 £ : 1.00 
It TH ~H L ,L TH L 
201 214 201 173 164 201 173 
~H ~H ~H ~L SL H TL 
201 215 201 174 164 201 174 
~L SH SH L ~L tH L 
201 217 201 174 163 201 174 
h:ll0 1:100 
d : 0.88 £ : 0.86 
h : 102 l: 88 
d: 0.88 £: 0.66 
h : 104 l: 96 
d : 0.86 £ : 1.00 
Since all executions to this point have been 
based on the first table of R values, it was decided 
to try a test with the second table of R values to 
see if the performance was different. Interestingly, 
the third solution in both of the above executi- 
ons was not found, though two new solutions were 
Oluld. 
I"I-I fil SH L SL I"lI \[, 
201 216 201 173 \[62 201 173 
L .LH .~a L SL I"H L 
201 215 201 174 163 201 174 
• I.H ~H ~H TL ~L H TL 
201 215 201 174 163 201 174 
~L I I L SH L tL .IAI 
201 214 201 173 163 201 173 
h: 94 1:80 
d : 0.88 t: : 0,49 
h: 97 l: 84 
d: 0.88 £: 0.65 
h: 100 l: 81 
d : 0.88 C : 0.92 
h: 92 l: 86 
d : 0.67 E : 0.48 
.I.H "I"H -I.H L .I.L tH L h: 107 l: 92 
201 216 201 173 162 201 173 d:0.87 ~: 0.40 
L H L SH L TL SH h:99 1:93 
201 214 201 173 163 201 173 d:0.65 S: 0.82 
SL .l.H SH L ,I.L ~H L h:90 1:78 
201 217 202 174 163 202 174 d: 0.88 £: 0.86 
Observe that the value of d ill tile above solu- 
tions clusters around 0.66 and 0.87. Simila.r clu- 
stering may bc occurring with the ratio h/l. Ho- 
wever, all analysis of the relationship between the 
kinds of solutions found, tile two It tables and the 
parameter values h, l and d has not been attemp- 
ted. 
11 
Areas for Further Improvement 
It is rather unsatisfying that the performance of 
the two programs is heavily dependent on (,he set- 
ting of several search parameters, and it seems to 
be a combinatorial optimisation problem in itself 
to find good parameter settings. My triM-and- 
error approach will not necessarily have found op- 
timal parameter values, and so it would I,e pre- 
mature to conclude from tile performance compa- 
rison thai. annealing search is better than genetic 
search for the problem of tone transcription. A 
more thoroughgoing comparison of these two ap- 
proaches to the problem needs to be undertaken. 
Since the parameters are continuous variables, 
and since the evaluation function--which we could 
write as CT,x(h,l,d)--is a smoothly continuous 
function in h, l, d, it would be worthwhile to try 
other (deterministic) search methods for optimi- 
sing h, l and d, once a candidate tone transcription 
T has been found. 
Finally, it would be interesting to integrate a 
system like either of the ones presented here into a 
speech workstation. As the phonologist identifies 
salient points with a cursor the system would do 
the traJ~scril)tion , incrementally and interactively. 
This 
blem 
sical 
that 
CONCLUSION 
paper began with a discussion of the pro- 
of relating tone transcriptions to their phy- 
counterparts, namely F0 traces. I showed 
it is desirable for phonologists working on 
tone to use sequences of F0 values as their pri- 
mary data, rather than impressionistic transcrip- 
tions which make (usually implicit) assumptions 
about F0 scaling. I provided an F0 prediction fun- 
ction 'P which estimated the F0 value of a tone, 
given the F0 value of the previous tone a.nd the 
identities of the two tones. I presented instru- 
mental data from Bamileke Dschang and showed 
how the function could be specialised for this lan- 
guage. The function was then incorporated into 
the evaluation functions of two implement~,d non- 
deterministic search algorithms. The performance 
results were encouraging and demonstrate the pro- 
raise of automated tone transcription. 
ACKNOWLEDGEMENTS 
This research is funded by the UK Economic 
and Social Research Council, under grant R00023 
4439 A Computational Model for the Phonology- 
Phonetics Interface in Tone Languages. I am in- 
debted to SIL Cameroon for their logistical sup- 
port on nly field trip in September and October of 
1993, during which the data presented in i.he pa- 
p(~r (and much other data besides) was gathered, 
and especially to Nancy Haynes, Gretchen Harro 
for helping me collect the data and Jean-Claude 
Gnintedem who endured many recording sessions. 
I am gratefifl to John Coleman, Michael Gasser 
and Marie South for helpfnl comments on an ear- 
lier version of this paper. The F0 data was ex- 
tracted using the ESPS Waves+ package in the 
Edinburgh University Phonetics Laboratory. 

References 
Bird, S. & Stegen, O. (1993). 'lbne in the Banfih'ke 
Dschang Associative Construction: All EI- 
ectrolaryngographic Study and C, onlparison 
with Hyman (1985). RP 57, University of 
Edinburgh, Centre for Cognitive Science. 
Clements, G. N. (1979). The description of 
terraced-level tone languages. Language, 55, 
536-558. 
Connell, B. & Ladd, D. R. (1990). Aspects of pitch 
realisation in Yoruba. Phonology, 7, 1-29. 
Ellison, T. M. (1993). Machine Learning of Pho- 
nological Structure. PhD thesis, I lniversity of 
Western Australia. 
Goldberg, D. E. (1989). Genetic Algorithms in 
Search, Oplimization, and Mm'him' I,carnin.q. 
Addison-Wesley. 
Iluang, X. D., Ariki, Y., ,~ Jack, M. (1!19(I). 
Hidden Markov Models tier Speech Recogni- 
tion. Edinburgh Information Technology Se- 
ries. Edinburgh University Press. 
Hyman, L. M. (1979). A reanalysis of tomd do- 
wnstep. Journal of African Languages and 
Linguistics, i, 9-29. 
Hyman, L. M. (1985). Word domains and down- 
step in Bamileke-Dschang. Phonology Year- 
book, 2, 45-83. i 
Hyman, L. M. & Schuh, R. G. (1974). Univer- 
sals of tone rules: evidence from West Aft'lea. 
Linguistic Inquiry, 5, 81-115. 
Liberman, M., SchUltz, J. M., Hong, S., g: Okeke, 
V. (1993). The Phonetic Interpretation of 
Tone in Igbo. Phonetica, 50; 147 160. 
Pierrehumbert, J. ~ Beckman, M. (1988)..lapa- 
nese Tone Structm~. Cambridge Mass.: M!'l' 
Press. 
South, M. C., Wetherill, G. B., & Tham, M. T. 
(1993). Hitch-hiker's guide to genetic algo- 
rithms. Journal of Applied Statistics, 20, 153- 
175. 
Stewart, J. M. (1993). Dschang and Ebri6 as Akan- 
type total downstep languages. In H. van der 
Hulst & K. Snider (Eds.), The Phonology of 
Tone - The Representation of Tonal Register 
(pp. 185-244). Berlin; New York: Mouton de 
Gruyter. Linguistic models, Volume 17. 
van Laarhoven, P. J. M. k Aarts, E. II. L. (1987). 
Simulated Annealing. Dordrccht:lt.ei(lel. 
