Improving Information Extraction by Modeling
Errors in Speech Recognizer Output
David D. Palmer a0a2a1
a0 The MITRE Corporation
202 Burlington Road
Bedford, MA 01730
palmer@mitre.org
Mari Ostendorf a1
a1 Electrical Engineering Dept.
University of Washington
Seattle, WA 98195
mo@ee.washington.edu
ABSTRACT
In this paper we describe a technique for improving the
performance of an information extraction system for speech
data by explicitly modeling the errors in the recognizer
output. The approach combines a statistical model of named
entity states with a lattice representation of hypothesized
words and errors annotated with recognition confidence
scores. Additional refinements include the use of multiple
error types, improved confidence estimation, and multi-
pass processing. In combination, these techniques im-
prove named entity recognition performance over a text-
based baseline by 28%.
Keywords
ASR error modeling, information extraction, word confi-
dence
1. INTRODUCTION
There has been a great deal of research on applying nat-
ural language processing (NLP) techniques to text-based
sources of written language data, such as newspaper and
newswire data. Most NLP approaches to spoken language
data, such as broadcast news and telephone conversations,
have consisted of applying text-based systems to the out-
put of an automatic speech recognition (ASR) system; re-
search on improving these approaches has focused on ei-
ther improving the ASR accuracy or improving the text-
based system (or both). However, applying text-based sys-
tems to ASR output ignores the fact that there are funda-
mental differences between written texts and ASR tran-
scriptions of spoken language: the style is different be-
tween written and spoken language, the transcription con-
ventions are different, and, most importantly, there are er-
rors in ASR transcriptions. In this work, we focus on the
third problem: handling errors by explicitly modeling un-
certainty in ASR transcriptions.
.
The idea of explicit error handling in information ex-
traction (IE) from spoken documents was introduced by
Grishman in [1], where a channel model of word insertions
and deletions was used with a deterministic pattern match-
ing system for information extraction. While the use of an
error model resulted in substantial performance improve-
ments, the overall performance was still quite low, perhaps
because the original system was designed to take advan-
tage of orthographic features. In looking ahead, Grishman
suggests that a probabilistic approach might be more suc-
cessful at handling errors.
The work described here provides such an approach, but
introduces an acoustically-driven word confidence score
rather than the word-based channel model proposed in [1].
More specifically, we provide a unified approach to pre-
dicting and using uncertainty in processing spoken lan-
guage data, focusing on the specific IE task of identifying
named entities (NEs). We show that by explicitly mod-
eling multiple types of errors in the ASR output, we can
improve the performance of an IE system, which benefits
further from improved error prediction using new features
derived from multi-pass processing.
The rest of the paper is organized as follows. In Sec-
tion 2 we describe our error modeling, including explicit
modeling of multiple ASR error types. New features for
word confidence estimation and the resulting performance
improvement is given in Section 3. Experimental results
for NE recognition are presented in Section 4 using Broad-
cast News speech data. Finally, in Section 5, we summa-
rize the key findings and implications for future work.
2. APPROACH
Our approach to error handling in information extrac-
tion involves using probabilistic models for both informa-
tion extraction and the ASR error process. The component
models and an integrated search strategy are described in
this section.
2.1 Statistical IE
We use a probabilistic IE system that relates a word se-
quence a3a5a4a7a6a9a8a11a10a13a12a2a12a13a12a14a10a15a6a17a16 to a sequence of information
states a18a19a4a21a20 a8 a10a13a12a13a12a13a12a14a10a22a20 a16 that provide a simple parse of the
word sequence into phrases, such as name phrases. For
the work described here, the states a20a24a23 correspond to dif-
ferent types of NEs. The IE model is essentially a phrase
language model:
a25a27a26
a18a27a10a22a3a29a28a30a4
a25a31a26
a20 a8 a10a13a12a13a12a13a12a14a10a22a20 a16 a10a32a6 a8 a10a13a12a13a12a13a12a14a10a33a6 a16 a28 (1)
a4
a16
a34
a23a36a35 a8
a25a27a26
a6a37a23a39a38 a6a37a23a41a40 a8 a10a22a20a2a23a42a28
a25a31a26
a20a24a23a22a38 a20a2a23a41a40 a8 a10a15a6a37a23a43a40 a8 a28
with state-dependent bigrams a25a27a26 a6 a23 a38 a6 a23a41a40 a8a2a10a33a20 a23 a28 that model
the types of words associated with a specific type of NE,
and state transition probabilitiesa25a27a26 a20a24a23a22a38 a20a2a23a41a40 a8 a10a32a6a37a23a41a40 a8 a28 that mix
the Markov-like structure of an HMM with dependence
on the previous word. (Note that titles, such as “Pres-
ident” and “Mr.”, are good indicators of transition to a
name state.)
This IE model, described further in [2], is similar to
other statistical approaches [3, 4] in the use of state depen-
dent bigrams, but uses a different smoothing mechanism
and state topology. In addition, a key difference in our
work is explicit error modeling in the “word” sequence, as
described next.
2.2 Error Modeling
To explicitly model errors in the IE system, we intro-
duce new notation for the hypothesized word sequence,
a44
a4a46a45a47a8a2a10a13a12a13a12a13a12a14a10a33a45a48a16 , which may differ from the actual word
sequence a3 , and a sequence of error indicator variables
a49
a4a51a50a52a8a13a10a2a12a11a12a13a12a47a10a15a50a53a16 , where a50 a23 a4a55a54 when a45 a23 is an error and
a50a56a23a17a4a58a57 when a45a59a23 is correct. We assume that the hypothe-
sized words from the recognizer are each annotated with
confidence scores
a60
a23a61a4
a25a27a26
a50a56a23a31a4a46a57a52a38
a44
a10a33a62a63a28a64a4
a25a27a26
a45a59a23a61a4a65a6a37a23a39a38
a44
a10a33a62a66a28a39a10
where a62 represents the set of features available for ini-
tial confidence estimation from the recognizer, acoustic or
otherwise.
. . . . . .
tt-1h h
ε ε
Figure 1: Lattice with correct and error paths.
We construct a simple lattice from a45 a8 a10a2a12a13a12a13a12a14a10a33a45 a16 with
“error” arcs indicated by a67 -tokens in parallel with each hy-
pothesized word a45 a23 , as illustrated in Figure 1. We then
find the maximum posterior probability state sequence by
summing over all paths through the lattice:
a18a69a68a70a4 a71a73a72a33a74a76a75a77a71a73a78
a79
a25a27a26
a18a80a38
a44
a10a15a62a66a28a39a10 (2)
a4 a71a73a72a33a74a76a75a77a71a73a78
a79 a81a41a82
a25a31a26
a18a61a10
a49
a38
a44
a10a33a62a66a28 (3)
or, equivalently, marginalizing over the sequence a49 . Equa-
tion 3 thus defines the decoding of named entities via the
state sequence a18 , which (again) provides a parse of the
word sequence into phrases.
Assuming first that a49 and a44 encode all the information
from a62 about a18 , and then that the specific value a45a48a23 occur-
ring at an error does not provide additional information for
the NE states1 a18 , we can rewrite Equation 3 as:
a18 a68 a4 a71a73a72a33a74a76a75a77a71a73a78
a79 a81 a82
a25a27a26
a49
a38
a44
a10a15a62a63a28
a25a27a26
a18a80a38
a49
a10
a44
a10a33a62a63a28
a4 a71a73a72a33a74a76a75a77a71a73a78
a79 a81 a82
a25a27a26
a49
a38
a44
a10a15a62a63a28
a25a27a26
a18a80a38
a49
a10
a44
a28
a4 a71a73a72a33a74a76a75a77a71a73a78
a79 a81 a82
a25a27a26
a49
a38
a44
a10a15a62a63a28
a25a27a26
a18a80a38 a3a19a83
a82a85a84 a86a85a87
a28a39a12
For the error model, a25a31a26 a49 a38
a44
a10a33a62a66a28 , we assume that er-
rors are conditionally independent given the hypothesized
word sequence a44 and the evidence a62 :
a25a27a26
a49
a38
a44
a10a15a62a63a28a64a4
a16
a34
a23a88a35 a8
a25a31a26
a50 a23 a38
a44
a10a15a62a63a28a39a12 (4)
where a60 a23 a4 a25a31a26 a50 a23 a4a89a57a90a38 a44 a10a33a62a66a28 is the ASR word “confi-
dence”. Of course, the errors are not independent, which
we take advantage of in our post-processing of confidence
estimates, described in Section 3.
We can find a25a27a26 a18a80a38 a3a29a28 directly from the information ex-
traction model, a25a27a26 a18a27a10a33a3a29a28 described in Section 2.1, but there
is no efficient decoding algorithm. Hence we approximate
a25a31a26
a18a37a38 a3a29a28a61a4
a25a27a26
a18a27a10a22a3a29a28
a25a27a26
a3a29a28a92a91a7a93
a25a27a26
a18a27a10a22a3a29a28 (5)
assuming that the different words that could lead to an er-
ror are roughly uniform over the likely set. More specifi-
cally,
a93
a25a27a26
a18a27a10a22a3a29a28 incorporates a scaling term as follows:
a93
a25a27a26
a67a24a38 a6a37a23a43a40 a8 a4a65a94a59a10a22a20a2a23a32a28a64a4
a54
a95a97a96
a25a27a26
a67a24a38 a6a37a23a41a40 a8 a4a65a94a59a10a22a20a2a23a32a28 (6)
where a95a98a96 is the number of different error words observed
after a94 in the training set and a25a27a26 a67a24a38 a94a59a10a33a20a24a23a42a28 is trained by col-
lapsing all different errors into a single label a67 . Training
this language model requires data that contains a67 -tokens,
which can be obtained by aligning the reference data and
the ASR output. In fact, we train the language model with
a combination of the original reference data and a dupli-
cate version with a67 -tokens replacing error words.
Because of the conditional independence assumptions
behind equations 1 and 4, there is an efficient algorithm
for solving equation 3, which combines steps similar to
the forward and Viterbi algorithms used with HMMs. The
search is linear with the length a99 of the hypothesized
word sequence and the size of the state space (the product
space of NE states and error states). The forward compo-
nent is over the error state (parallel branches in the lattice),
and the Viterbi component is over the NE states.
If the goal is to find the words that are in error (e.g. for
subsequent correction) as well as the named entities, then
the objective is
a26
a18a61a10
a49
a28a33a68a70a4 a71a24a72a33a74a100a75a77a71a73a78
a79
a84 a82
a25a27a26
a18a27a10
a49
a38
a44
a10a15a62a63a28 (7)
a91
a71a24a72a33a74a100a75a77a71a73a78
a79
a84 a82
a25a27a26
a49
a38
a44
a10a33a62a63a28
a93
a25a47a26
a18a61a10a22a3 a83
a82a85a84 a86a101a87
a28a39a10 (8)
a8 Clearly, some hypotheses do provide information about
a18 in that a reasonably large number of errors involve sim-
ple ending differences. However, our current system has
no mechanism for taking advantage of this information ex-
plicitly, which would likely add substantially to the com-
plexity of the model.
which simply involves finding the best path a49 a68 through
the lattice in Figure 1. Again because of the conditional
independence assumption, an efficient solution involves
Viterbi decoding over an expanded state space (the prod-
uct of the names and errors). The sequence a49 a68 can help
us define a new word sequence a102a3 that contains a67 -tokens:
a102
a6 a23 a4a55a45 a23 if a50 a23 a68 a4a55a57 , and
a102
a6 a23 a4a103a67 if a50 a23 a68 a4a104a54 . Joint error
and named entity decoding results in a small degradation
in named entity recognition performance, since only a sin-
gle error path is used. Since errors are not used explicitly
in this work, all results are based on the objective given by
equation 3.
Note that, unlike work that uses confidence scores a60 a23
as a weight for the hypothesized word in information re-
trieval [5], here the confidence scores also provide weights
a26
a54a98a105
a60
a23 a28 for explicit (but unspecified) sets of alternative
hypotheses.
2.3 Multiple Error Types
Though the model described above uses a single error
token a67 and a 2-category word confidence score (correct
word vs. error), it is easily extensible to multiple classes
of errors simply by expanding the error state space. More
specifically, we add multiple parallel arcs in the lattice in
Figure 1, labeled a67 a8 , a67a22a106 , etc., and modify confidence esti-
mation to predict multiple categories of errors.
In this work, we focus particularly on distinguishing
out-of-vocabulary (OOV) errors from in-vocabulary (IV)
errors, due to the large percentage of OOV words that are
names (57% of OOVs occur in named entities). Looking
at the data another way, the percentage of name words that
are OOV is an order of magnitude larger than words in
the “other” phrase category, as described in more detail
in [6]. As it turns out, since OOVs are so infrequent, it
is difficult to robustly estimate the probability of IV vs.
OOV errors from standard acoustic features, and we sim-
ply use the relative prior probabilities to scale the single
error probability.
3. CONFIDENCE PREDICTION
An essential component of our error model is the word-
level confidence score, a25a31a26 a50a53a23a22a38 a44 a10a33a62a66a28 , so one would expect
that better confidence scores would result in better error
modeling performance. Hence, we investigated methods
for improving the confidence estimates, focusing specifi-
cally on introducing new features that might complement
the features used to provide the baseline confidence esti-
mates. The baseline confidence scores used in this study
were provided by Dragon Systems. As described in [7],
the Dragon confidence predictor used a generalized lin-
ear model with six inputs: the word duration, the lan-
guage model score, the fraction of times the word appears
in the top 100 hypotheses, the average number of active
HMM states in decoding for the word, a normalized acous-
tic score and the log of the number of recognized words
in the utterance. We investigated several new features, of
which the most useful are listed below.
First, we use a short window of the original confidence
scores: a60 a23 , a60 a23a43a40 a8 and a60 a23a88a107 a8 . Note that the post-processing
paradigm allows us to use non-causal features such as a60 a23a88a107 a8 .
We also define three features based on the ratios of a60 a23a41a40 a8 ,
a60
a23 , and a60 a23a36a107 a8 to the average confidence for the document
in which a45 a23 appears, under the assumption that a low con-
fidence score for a word is less likely to indicate a word
error if the average confidence for the entire document
is also low. We hypothesized that words occurring fre-
quently in a large window would be more likely to be cor-
rect, again assuming that the ASR system would make er-
rors randomly from a set of possibilities. Therefore, we
define features based on how many times the hypothesis
word a45a59a23 occurs in a window a26 a45a59a23a41a40a59a108a109a10a2a12a110a12a110a12a110a10a22a45a59a23a22a10a11a12a111a12a110a12a110a10a22a45a59a23a88a107a109a108a90a28 for
a112
a4a114a113 , 10, 25, 50, and 100 words. Finally, we also use the
relative frequency of words occurring as an error in the
training corpus, again looking at a window of a115a116a54 around
the current word.
Due to the close correlation between names and errors,
we would expect to see improvement in the error mod-
eling performance by including information about which
words are names, as determined by the NE system. There-
fore, in addition to the above set of features, we define a
new feature: whether the hypothesis word a45 a23 is part of a
location, organization, or person phrase. We can deter-
mine the value of this feature directly from the output of
the NE system. Given this additional feature, we can de-
fine a multi-pass processing cycle consisting of two steps:
confidence re-estimation and information extraction. To
obtain the name information for the first pass, the confi-
dence scores are re-estimated without using the name fea-
tures, and these confidences are used in a joint NE and
error decoding system. The resulting name information is
then used, in addition to all the features used in the previ-
ous pass, to improve the word confidence estimates. The
improved confidences are in turn used to further improve
the performance of the NE system.
We investigated three different methods for using the
above features in confidence estimation: decision trees,
generalized linear models, and linear interpolation of the
outputs of the decision tree and generalized linear model.
The decision trees and generalized linear models gave sim-
ilar performance, and a small gain was obtained by inter-
polating these predictions. For simplicity, the results here
use only the decision tree model.
A standard method for evaluating confidence predic-
tion [8] is the normalized cross entropy (NCE) of the bi-
nary correct/error predictors, that is, the reduction in un-
certainty in confidence prediction relative to the ASR sys-
tem error rate. Using the new features in a decision tree
predictor, the NCE score of the binary confidence predic-
tor improved from 0.195 to 0.287. As shown in the next
section, this had a significant impact on NE performance.
(See [6] for further details on these experiments and an
analysis of the relative importance of different factors.)
4. EXPERIMENTAL RESULTS
The specific information extraction task we address in
this work is the identification of name phrases (names of
persons, locations, and organizations), as well as identi-
fication of temporal and numeric expressions, in the ASR
output. Also known as named entities (NEs), these phrases
are useful in many language understanding tasks, such as
coreference resolution, sentence chunking and parsing, and
summarization/gisting.
4.1 Data and Evaluation Method
The data we used for the experiments described in this
paper consisted of 114 news broadcasts automatically an-
notated with recognition confidence scores and hand la-
beled with NE types and locations. The data represents
an intersection of the data provided by Dragon Systems
for the 1998 DARPA-sponsored Hub-4 Topic, Detection
and Tracking (TDT) evaluation and those stories for which
named entity labels were available. Broadcast news data
is particularly appropriate for our work since it contains a
high density of name phrases, has a relatively high word
error rate, and requires a virtually unlimited vocabulary.
We used two versions of each news broadcast: a refer-
ence transcription prepared by a human annotator and an
ASR transcript prepared by Dragon Systems for the TDT
evaluation [7]. The Dragon ASR system had a vocabulary
size of about 57,000 words and a word error rate (WER) of
about 30%. The ASR data contained the word-level confi-
dence information, as described earlier, and the reference
transcription was manually-annotated with named entity
information. By aligning the reference and ASR transcrip-
tions, we were able to determine which ASR output words
corresponded to errors and to the NE phrases.
We randomly selected 98 of the 114 broadcasts as train-
ing data, 8 broadcasts as development test, and 8 broad-
casts as evaluation test data, which were kept “blind” to
ensure unbiased evaluation results. We used the training
data to estimate all model parameters, the development
test set to tune parameters during development, and the
evaluation test set for all results reported here. For all ex-
periments we used the same training and test data.
4.2 Information Extraction Results
Table 1 shows the performance of the baseline informa-
tion extraction system (row 1) which does not model er-
rors, compared to systems using one and two error types,
with the baseline confidence estimates and the improved
confidence estimates from the previous section. Perfor-
mance figures are the standard measures used for this task:
F-measure (harmonic mean of recall and precision) and
slot error rate (SER), where separate type, extent and con-
tent error measures are averaged to get the reported result.
The results show that modeling errors gives a significant
improvement in performance. In addition, there is a small
but consistent gain from modeling OOV vs. IV errors sep-
arately. Further gain is provided by each improvement to
the confidence estimator.
Since the evaluation criterion involves a weighted av-
erage of content, type and extent errors, there is an upper
bound of 86.4 for the F-measure given the errors in the
recognizer output. In other words, this is the best perfor-
mance we can hope for without running additional pro-
cessing to correct the ASR errors. Thus, the combined
error modeling improvements lead to recovery of 28% of
the possible performance gains from this scheme. It is also
interesting to note that the improvement in identifying the
extent of a named entity actually results in a decrease in
performance of the content component, since words that
are incorrectly recognized are introduced into the named
entity regions.
5. DISCUSSION
In this paper we described our use of error modeling
to improve information extraction from speech data. Our
model is the first to explicitly represent the uncertainty in-
herent in the ASR output word sequence. Two key in-
Table 1: Named entity (NE) recognition results using dif-
ferent error models and feature sets for predicting confi-
dence scores. The baseline confidence scores are from the
Dragon recognizer, the secondary processing re-estimates
confidences as a function of a window of these scores, and
the names are provided by a previous pass of named entity
detection.
Confidence NE NE
a67 -tokens Scores F-Measure SER
none none 68.4 50.9
1 baseline 71.4 46.1
2 baseline 71.5 45.9
1 + secondary 71.8 44.9
2 + secondary 72.0 44.8
1 + secondary + names 73.1 44.3
2 + secondary + names 73.4 43.9
novations are the use of word confidence scores to char-
acterize the ASR outputs and alternative hypotheses, and
integration of the error model with a statistical model of
information extraction. In addition, improvements in per-
formance were obtained by modeling multiple types of er-
rors (in vocabulary vs. out of vocabulary) and adding new
features to the confidence estimator obtained using multi-
pass processing. The new features led to improved confi-
dence estimation from a baseline NCE of 0.195 to a value
of 0.287. The use of the error model with these improve-
ments resulted in a reduction in slot error rate of 14% and
an improvement in the F-measure from 68.4 to 73.4.
The integrated model can be used for recognition of
NE’s alone, as in this work, or in joint decoding of NEs
and errors. Since ASR errors substantially degrade NE
recognition rates (perfect NE labeling with the errorful
outputs here would have an F-measure of 86.4), and since
many names are recognized in error because they are out
of the recognizer’s vocabulary, an important next step in
this research is explicit error detection and correction. Pre-
liminary work in this direction is described in [6]. In ad-
dition, while this work is based on 1-best recognition out-
puts, it is straightforward to use the same algorithm for
lattice decoding, which may also provide improved NE
recognition performance.
Acknowledgments
The authors thank Steven Wegmann of Dragon Systems
for making their ASR data available for these experiments
and BBN for preparing and releasing additional NE train-
ing data. This material is based in part upon work sup-
ported by the National Science Foundation under Grant
No. IIS0095940. Any opinions, findings, and conclusions
or recommendations expressed in this material are those
of the author(s) and do not necessarily reflect the views of
the National Science Foundation.
6. REFERENCES
[1] R. Grishman, “Information extraction and speech
recognition,” Proceedings of the Broadcast News
Transcription and Understanding Workshop, pp.
159–165, 1998.
[2] D. Palmer, M. Ostendorf, and J. Burger ‘Robust
Information Extraction from Automatically
Generated Speech Transcriptions,” Speech
Communication, vol. 32, pp. 95–109, 2000.
[3] D. Bikel, R. Schwartz, R. Weischedel, “An
Algorithm that Learns What’s in a Name,” Machine
Learning, 34(1/3):211–231, 1999.
[4] Y. Gotoh, S. Renals, “Information Extraction From
Broadcast News,”Philosophical Transactions of the
Royal Society, series A: Mathematical, Physical and
Engineering Sciences, 358(1769):1295–1308, 2000.
[5] A. Hauptmann, R. Jones, K. Seymore, S. Slattery,
M. Witbrock, and M. Siegler, “Experiments in
information retrieval from spoken documents,”
Proceedings of the Broadcast News Transcription
and Understanding Workshop, pp. 175–181, 1998.
[6] D. Palmer, Modeling Uncertainty for Information
Extraction from Speech Data, Ph.D. dissertation,
University of Washington, 2001.
[7] L. Gillick, Y. Ito, L. Manganaro, M. Newman, F.
Scattone, S. Wegmann, J. Yamron, and P. Zhan,
“Dragon Systems’ Automatic Transcription of New
TDT Corpus,” Proceedings of the Broadcast News
Transcription and Understanding Workshop, pp.
219–221, 1998.
[8] M. Siu and H. Gish, “Evaluation of word confidence
for speech recognition systems,” Computer Speech
& Language, 13(4):299–319, 1999.
