Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume, pages 292–295,
New York City, June 2006. c©2006 Association for Computational Linguistics
AUTOMATEDQUALITYMONITORINGFORCALLCENTERSUSINGSPEECHANDNLP
TECHNOLOGIES
G. Zweig, O. Siohan,G. Saon,B. Ramabhadran, D. Povey, L. Manguand B. Kingsbury
IBM T.J. WatsonResearchCenter, Yorktown Heights,NY 10598
ABSTRACT
This paper describesan automatedsystemfor assigningqual-
ity scoresto recordedcall center conversations. The systemcom-
binesspeechrecognition,patternmatching,andmaximumentropy
classificationto rank calls according to their measured quality.
Callsat both ends of the spectrumare flaggedas “interesting”and
made availablefor furtherhuman monitoring.In this process,the
ASR transcriptis used to answer a set of standardqualitycontrol
questionssuchas “didthe agentuse courteouswordsand phrases,”
and to generatea question-basedscore. This is interpolatedwith
the probabilityof a call being “bad,” as determinedby maximum
entropy operatingon a set of ASR-derived featuressuch as “max-
imumsilencelength”and the occurrenceof selectedn-gramword
sequences. The systemis trainedon a set of calls with associated
manual evaluationforms. We present precisionand recall results
fromIBM’s NorthAmericanHelpDeskindicatingthat for a given
amount of listeningeffort, this system triples the number of bad
calls that are identified,over the current policy of randomlysam-
plingcalls. The applicationthat will be demonstratedis a research
prototypethat was built in conjunctionwith IBM’s North Ameri-
can call centers.
1. INTRODUCTION
Every day, tens of millionsof help-deskcalls are recordedat call
centersaroundthe world. As part of a typicalcall centeroperation
a random sample of these calls is normally re-played to human
monitorswho score the calls with respect to a variety of quality
relatedquestions,e.g.
• Was the accountsuccessfullyidentifiedby the agent?
• Did the agent request error codes/messagesto help deter-
mine the problem?
• Was the problemresolved?
• Did the agent maintainappropriatetone, pitch,volumeand
pace?
This process suffers from a number of importantproblems: first,
the monitoringat leastdoublesthe cost of eachcall (firstan opera-
tor is paid to take it, then a monitorto evaluateit). This causesthe
second problem,which is that thereforeonly a very small sample
of calls, e.g. a fraction of a percent, is typically evaluated. The
third problemarisesfrom the fact that most calls are ordinaryand
uninteresting;with random sampling,the human monitorsspend
most of their time listeningto uninterestingcalls.
This work describesan automatedquality-monitoringsystem
that addresses these problems. Automaticspeech recognitionis
used to transcribe 100% of the calls coming in to a call center,
and default quality scores are assignedbased on featuressuch as
key-words, key-phrases,the number and type of hesitations,and
the average silence durations. The default score is used to rank
the calls from worst-to-best,and this sorted list is made available
to the human evaluators, who can thus spend their time listening
only to calls for whichthere is some a-priorireasonto expect that
there is somethinginteresting.
The automatic quality-monitoringproblem is interesting in
part becauseof the variabilityin how hard it is to answerthe ques-
tions. Some questions,for example,“Did the agent use courteous
words and phrases?” are relatively straightforward to answer by
looking for key words and phrases. Others, however, require es-
sentiallyhuman-level knowledgeto answer;for exampleone com-
pany’s monitorsare asked to answer the question “Did the agent
take ownershipof the problem?” Our work focuseson calls from
IBM’s NorthAmericancallcenters,wherethereis a set of 31 ques-
tionsthat are usedto evaluatecall-quality. Becauseof the high de-
gree of variabilityfound in these calls, we have investigatedtwo
approaches:
1. Use a partial score based only on the subset of questions
that can be reliablyanswered.
2. Use a maximum entropy classifier to map directly from
ASR-generatedfeaturesto the probabilitythat a call is bad
(definedas belongingto the bottom20% of calls).
We have foundthat both approachesare workable,and we present
final results based on an interpolationbetween the two scores.
These results indicate that for a fixed amount of listening effort,
the number of bad calls that are identified approximatelytriples
with our call-rankingapproach.Surprisingly, whiletherehas been
significantprevious scholarly research in automatedcall-routing
and classificationin the call center , e.g. [1, 2, 3, 4, 5], there has
been much less in automatedqualitymonitoringper se.
2. ASRFORCALLCENTERTRANSCRIPTION
2.1. Data
The speech recognition systems were trained on approximately
300 hours of 6kHz, mono audio data collectedat one of the IBM
call centerslocatedin Raleigh,NC. The audio was manuallytran-
scribedand speaker turns wereexplicitlymarked in the word tran-
scriptions but not the correspondingtimes. In order to detect
speaker changesin the trainingdata, we did a forced-alignmentof
the dataand choppedit at speaker boundaries.The test set consists
of 50 callswith 113 speakers totalingabout 3 hoursof speech.
2.2. Speaker IndependentSystem
The raw acoustic features used for segmentationand recognition
are perceptuallinear prediction(PLP) features. The features are
292
Segmentation/clustering Adaptation WER
Manual Off-line 30.2%
Manual Incremental 31.3%
Manual No Adaptation 35.9%
Automatic Off-line 33.0%
Automatic Incremental 35.1%
Table 1. ASR results dependingon segmentation/clusteringand
adaptationtype.
Accuracy Top 20% Bottom20%
Random 20% 20%
QA 41% 30%
Table 2. Accuracy for the QuestionAnsweringsystem.
mean-normalized40-dimensionalLDA+MLLT features. The SI
acousticmodel consistsof 50K Gaussianstrained with MPE and
uses a quinphonecross-word acousticcontext. The techniquesare
the sameas those describedin [6].
2.3. IncrementalSpeaker Adaptation
In the context of speaker-adaptive training, we use two forms
of feature-spacenormalization: vocal tract length normalization
(VTLN) and feature-spaceMLLR (fMLLR,also known as con-
strained MLLR) to produce canonical acoustic models in which
some of the non-linguisticsourcesof speechvariabilityhave been
reduced.To this canonicalfeaturespace,we then apply a discrim-
inatively trainedtransformcalled fMPE[7]. The speaker adapted
recognitionmodel is trained in this resultingfeature space using
MPE.
We distinguishbetweentwo formsof adaptation:off-line and
incrementaladaptation. For the former, the transformationsare
computedper conversation-sideusing the full output of a speaker
independentsystem.For thelatter, thetransformationsare updated
incrementallyusingthedecodedoutputof thespeaker adaptedsys-
tem up to the current time. The speaker adaptive transformsare
then appliedto the futuresentences.The advantageof incremental
adaptationis that it only requires a single decoding pass (as op-
posedto two passesfor off-lineadaptation)resultingin a decoding
process which is twice as fast. In Table 1, we compare the per-
formanceof the two approaches. Most of the gain of full offline
adaptationis retainedin the incrementalversion.
2.3.1. Segmentationand Speaker Clustering
We use an HMM-basedsegmentationprocedure for segmenting
the audio into speechand non-speechprior to decoding. The rea-
son is that we want to eliminatethe non-speechsegmentsin order
to reduce the computationalload during recognition. The speech
segmentsare clusteredtogetherin orderto identifysegmentscom-
ing from the samespeaker whichis crucialfor speaker adaptation.
The clusteringis done via k-means,each segment being modeled
by a single diagonalcovarianceGaussian. The metricis given by
the symmetricK-L divergence between two Gaussians. The im-
pact of the automaticsegmentationand clusteringon the errorrate
is indicatedin Table 1.
Accuracy Top 20% Bottom20%
Random 20% 20%
ME 49% 36%
Table 3. Accuracy for the MaximumEntropy system.
Accuracy Top 20% Bottom20%
Random 20% 20%
ME + QA 53% 44%
Table 4. Accuracy for the combinedsystem.
3. CALLRANKING
3.1. QuestionAnswering
This section presents automated techniques for evaluating call
quality. These techniques were developed using a train-
ing/development set of 676 calls with associated manually gen-
eratedqualityevaluations.The test set consistsof 195 calls.
The qualityof the serviceprovidedby the help-deskrepresen-
tatives is commonlyassessedby having humanmonitorslisten to
a randomsampleof the callsand then fill in evaluationforms. The
formfor IBM’s NorthAmericanHelpDeskcontains31 questions.
A subset of the questionscan be answeredeasily using automatic
methods,amongthose the ones that check that the agent followed
the guidelinese.g.
• Did the agentfollow the appropriateclosingscript?
• Did the agentidentifyherselfto the customer?
But some of the questionsrequire human-level knowledge of the
world to answer, e.g.
• Did the agent ask pertinentquestionsto gain clarity of the
problem?
• Were all availableresourcesused to solve the problem?
We were able to answer 21 out of the 31 questions using pat-
tern matching techniques. For example, if the question is “Did
the agent follow the appropriateclosing script?”, we search for
“THANK YOU FOR CALLING”, “ANYTHING ELSE” and
“SERVICEREQUEST”.Any of these is a good partialmatch for
the full script,“Thankyou for calling,is there anything else I can
help you with before closing this service request?” Based on the
answer to each of the 21 questions,we computea score for each
call and use it to rank them. We label a call in the test set as being
bad/good if it has been placed in the bottom/top20% by human
evaluators. We report the accuracy of our scoring system on the
test set by computing the number of bad calls that occur in the
bottom20% of our sorted list and the numberof good calls found
in the top 20% of our list. The accuracy numberscan be found in
Table 2.
3.2. MaximumEntropy Ranking
Anotheralternative for scoringcalls is to find arbitraryfeaturesin
the speechrecognitionoutputthat correlatewith the outcomeof a
call being in the bottom 20% or not. The goal is to estimatethe
probabilityof a call being bad based on features extracted from
the automatictranscription.To achieve this we build a maximum
293
Fig. 1. Displayof selectedcalls.
entropy based systemwhich is trainedon a set of calls with asso-
ciatedtranscriptionsand manualevaluations.The following equa-
tion is used to determinethe score of a call C using a set of N
predefinedfeatures:
P(class/C) = 1Zexp(
NX
i=1
λifi(class,C)) (1)
whereclass ∈ {bad,not−bad}, Z is a normalizingfactor, fi()
are indicatorfunctionsand {λi}{i=1,N} are the parametersof the
modelestimatedvia iterative scaling[8].
Due to the fact that our trainingset containedunder700 calls,
we used a hand-guided method for defining features. Specifi-
cally, we generated a list of VIP phrases as candidate features,
e.g. “THANKYOU FOR CALLING”,and “HELP YOU”. We
also createda pool of genericASR features,e.g. “numberof hes-
itations”,“total silence duration”,and “longestsilence duration”.
A decisiontree was then used to select the most relevant features
and the thresholdassociatedwitheachfeature.Thefinalset of fea-
turescontained5 genericfeaturesand 25 VIPphrases.If we take a
look at the weightslearnedfor differentfeatures,we can see that if
a call has many hesitationsand long silencesthen most likely the
call is bad.
We use P(bad|C) as shown in Equation1 to rank all the calls.
Table 3 shows the accuracy of this system for the bottomand top
20% of the test calls.
At this point we have two scoring mechanismsfor each call:
one that relies on answeringa fixed number of evaluation ques-
tions and a more global one that looks across the entire call for
hints. These two scores are both between0 and 1, and therefore
can be interpolatedto generateone uniquescore. Afteroptimizing
the interpolationweightson a held-outset we obtaineda slightly
higher weight (0.6) for the maximum entropy model. It can be
seenin Table4 that the accuracy of the combinedsystemis greater
that the accuracy of each individual system, suggestingthe com-
plementarityof the two initialsystems.
4. END-TO-ENDSYSTEMPERFORMANCE
4.1. Application
This section describesthe user interface of the automatedquality
monitoringapplication. As explained in Section 1, the evalua-
Fig. 2. Interface to listento audio and updatethe evaluationform.
tor scores calls with respect to a set of quality-relatedquestions
after listeningto the calls. To aid this process, the user interface
provides an efficientmechanismfor the humanevaluatorto select
calls,e.g.
• All calls from a specificagentsortedby score
• The top 20% or the bottom20% of the callsfroma specific
agentranked by score
• The top 20% or the bottom20% of all calls from all agents
The automated quality monitoringuser interface is a J2EE web
applicationthat is supported by back-end databases and content
managementsystems 1 The displayedlist of calls provides a link
to the audio, the automaticallyfilled evaluation form, the overall
score for this call, the agent’s name, server location,call id, date
and durationof the call (see Figure 1). This interface now gives
the agent the ability to listen to interestingcalls and update the
answersin the evaluationform if necessary(audio and evaluation
form illustratedin 2). In addition,this interface providesthe eval-
uator with the ability to view summarystatistics(average score)
and additionalinformationaboutthe qualityof the calls. The over-
all systemis designedto automaticallydownloadcalls from mul-
tiple locationson a daily-basis,transcribeand index them,thereby
making them available to the supervisorsfor monitoring. Calls
spanning a month are available at any given time for monitoring
purposes.
4.2. Precisionand Recall
This section presents precision and recall numbers for the
identificationof “bad” calls. The test set consistsof 195 calls that
were manuallyevaluatedby call centerpersonnel.Basedon these
manual scores, the calls were ordered by quality, and the bottom
20% were deemed to be “bad.” To retrieve calls for monitoring,
we sort the callsbasedon the automaticallyassignedqualityscore
and returnthe worst. In our summaryfigures,precisionand recall
are plotted as a function of the number of calls that are selected
for monitoring. This is importantbecause in reality only a small
numberof callscan receive humanattention.Precisionis the ratio
1In our case, the backend consistsof DB2and IBM’s WebsphereInfor-
mation Integrator for Contentand the applicationis hosted on Websphere
5.1.)
294
0
20
40
60
80
100
0 20 40 60 80 100
Observed
Ideal
Random
Fig. 3. Precisionfor the bottom20% of the calls as a functionof
the numberof callsretrieved.
0
20
40
60
80
100
0 20 40 60 80 100
Observed
Ideal
Random
Fig. 4. Recallfor the bottom20% of the calls.
of bad calls retrieved to the total number of calls monitored,and
recall is the ratio of the number of bad calls retrieved to the total
numberof bad callsin the test set. Threecurves are shown in each
plot: the actuallyobserved performance,performanceof random
selection, and oracle or ideal performance. Oracle performance
shows what would happen if a perfect automaticordering of the
calls was achieved.
Figure 3 shows precision performance. We see that in the
monitoring regime where only a small fraction of the calls are
monitored,we achieve over 60% precision. (Further, if 20% of
the calls are monitored,we still attainover 40% precision.)
Figure4 shows the recall performance.In the regime of low-
volume monitoring,the recall is midway between what could be
achieved withan oracle,and the performanceof random-selection.
Figure5 shows the ratioof the numberof bad callsfoundwith
our automatedrankingto thenumberfoundwithrandomselection.
This indicatesthat in the low-monitoringregime, our automated
techniquetriplesefficiency.
4.3. Humanvs. ComputerRankings
As a final measure of performance, in Figure 6 we present a
scatterplotcomparing human to computer rankings. We do not
have calls that are scored by two humans,so we cannot presenta
human-humanscatterplotfor comparison.
5. CONCLUSION
This paper has presentedan automatedsystem for quality moni-
toring in the call center. We proposea combinationof maximum-
entropy classificationbasedon ASR-derived features,andquestion
answeringbased on simplepattern-matching.The systemcan ei-
ther be used to replace human monitors, or to make them more
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
Observed
Ideal
Fig. 5. Ratioof bad callsfoundwithQTMto Randomselectionas
a functionof the numberof bad calls retrieved.
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140 160 180 200
Fig. 6. Scatterplot of Humanvs. ComputerRank.
efficient. Our resultsshow that we can triple the efficiency of hu-
man monitorsin the sense of identifyingthree times as many bad
calls for the sameamountof listeningeffort.
6. REFERENCES
[1] J. Chu-Carrolland B. Carpenter, “Vector-based natural lan-
guagecall routing,” ComputationalLinguistics, 1999.
[2] P. Haffner, G. Tur, and J. Wright, “Optimizingsvmsfor com-
plex call classification,” 2003.
[3] M. Tang, B. Pellom,and K. Hacioglu, “Call-typeclassifica-
tion and unsupervisedtrainingfor the call centerdomain,” in
ARSU-2003, 2003.
[4] D. Hakkani-Tur, G. Tur, M. Rahim,and G. Riccardi, “Unsu-
pervisedand active learningin automaticspeech recognition
for call classification,” in ICASSP-04, 2004.
[5] C. Wu, J. Kuo, E.E. Jan, V. Goel,and D. Lubensky, “Improv-
ing end-to-endperformanceof call classificationthroughdata
confusion reduction and model tolerance enhancement,” in
Interspeech-05, 2005.
[6] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and
G. Zweig, “The ibm 2004 conversational telephony system
for rich transcription,” in Eurospeech-2005, 2005.
[7] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau,
and G. Zweig, “fMPE:Discriminatively trained features for
speechrecognition,” in ICASSP-2005, 2004.
[8] A. Berger, S. Della Pietra,and V. Della Pietra, “A maximum
entropy approachto naturallanguageprocessing,” Computa-
tionalLinguistics, vol. 22, no. 1, 1996.
295
