XML Viewer - h01-1034

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1034_metho.xml
Size: 13,209 bytes
Last Modified: 2025-10-06 14:07:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1034">
  <Title>Improving Information Extraction by Modeling Errors in Speech Recognizer Output</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. APPROACH
</SectionTitle>
    <Paragraph position="0"> Our approach to error handling in information extraction involves using probabilistic models for both information extraction and the ASR error process. The component models and an integrated search strategy are described in this section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Statistical IE
</SectionTitle>
      <Paragraph position="0"> We use a probabilistic IE system that relates a word sequence a3a5a4a7a6a9a8a11a10a13a12a2a12a13a12a14a10a15a6a17a16 to a sequence of information states a18a19a4a21a20 a8 a10a13a12a13a12a13a12a14a10a22a20 a16 that provide a simple parse of the word sequence into phrases, such as name phrases. For the work described here, the states a20a24a23 correspond to different types of NEs. The IE model is essentially a phrase language model:</Paragraph>
      <Paragraph position="2"> with state-dependent bigrams a25a27a26 a6 a23 a38 a6 a23a41a40 a8a2a10a33a20 a23 a28 that model the types of words associated with a specific type of NE, and state transition probabilitiesa25a27a26 a20a24a23a22a38 a20a2a23a41a40 a8 a10a32a6a37a23a41a40 a8 a28 that mix the Markov-like structure of an HMM with dependence on the previous word. (Note that titles, such as &amp;quot;President&amp;quot; and &amp;quot;Mr.&amp;quot;, are good indicators of transition to a name state.) This IE model, described further in [2], is similar to other statistical approaches [3, 4] in the use of state dependent bigrams, but uses a different smoothing mechanism and state topology. In addition, a key difference in our work is explicit error modeling in the &amp;quot;word&amp;quot; sequence, as described next.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Error Modeling
</SectionTitle>
      <Paragraph position="0"> To explicitly model errors in the IE system, we introduce new notation for the hypothesized word sequence,  a4a46a45a47a8a2a10a13a12a13a12a13a12a14a10a33a45a48a16 , which may differ from the actual word sequence a3 , and a sequence of error indicator variables  a4a51a50a52a8a13a10a2a12a11a12a13a12a47a10a15a50a53a16 , where a50 a23 a4a55a54 when a45 a23 is an error and a50a56a23a17a4a58a57 when a45a59a23 is correct. We assume that the hypothesized words from the recognizer are each annotated with confidence scores</Paragraph>
      <Paragraph position="2"> where a62 represents the set of features available for initial confidence estimation from the recognizer, acoustic or otherwise.</Paragraph>
      <Paragraph position="4"> We construct a simple lattice from a45 a8 a10a2a12a13a12a13a12a14a10a33a45 a16 with &amp;quot;error&amp;quot; arcs indicated by a67 -tokens in parallel with each hypothesized word a45 a23 , as illustrated in Figure 1. We then find the maximum posterior probability state sequence by summing over all paths through the lattice:</Paragraph>
      <Paragraph position="6"> or, equivalently, marginalizing over the sequence a49 . Equation 3 thus defines the decoding of named entities via the state sequence a18 , which (again) provides a parse of the word sequence into phrases.</Paragraph>
      <Paragraph position="7"> Assuming first that a49 and a44 encode all the information from a62 about a18 , and then that the specific value a45a48a23 occurring at an error does not provide additional information for the NE states1 a18 , we can rewrite Equation 3 as:</Paragraph>
      <Paragraph position="9"> For the error model, a25a31a26 a49 a38  a10a33a62a66a28 , we assume that errors are conditionally independent given the hypothesized word sequence a44 and the evidence a62 :</Paragraph>
      <Paragraph position="11"> where a60 a23 a4 a25a31a26 a50 a23 a4a89a57a90a38 a44 a10a33a62a66a28 is the ASR word &amp;quot;confidence&amp;quot;. Of course, the errors are not independent, which we take advantage of in our post-processing of confidence estimates, described in Section 3.</Paragraph>
      <Paragraph position="12"> We can find a25a27a26 a18a80a38 a3a29a28 directly from the information extraction model, a25a27a26 a18a27a10a33a3a29a28 described in Section 2.1, but there is no efficient decoding algorithm. Hence we approximate</Paragraph>
      <Paragraph position="14"> assuming that the different words that could lead to an error are roughly uniform over the likely set. More specifi-</Paragraph>
      <Paragraph position="16"> where a95a98a96 is the number of different error words observed after a94 in the training set and a25a27a26 a67a24a38 a94a59a10a33a20a24a23a42a28 is trained by collapsing all different errors into a single label a67 . Training this language model requires data that contains a67 -tokens, which can be obtained by aligning the reference data and the ASR output. In fact, we train the language model with a combination of the original reference data and a duplicate version with a67 -tokens replacing error words.</Paragraph>
      <Paragraph position="17"> Because of the conditional independence assumptions behind equations 1 and 4, there is an efficient algorithm for solving equation 3, which combines steps similar to the forward and Viterbi algorithms used with HMMs. The search is linear with the length a99 of the hypothesized word sequence and the size of the state space (the product space of NE states and error states). The forward component is over the error state (parallel branches in the lattice), and the Viterbi component is over the NE states.</Paragraph>
      <Paragraph position="18"> If the goal is to find the words that are in error (e.g. for subsequent correction) as well as the named entities, then the objective is</Paragraph>
      <Paragraph position="20"> a8 Clearly, some hypotheses do provide information about a18 in that a reasonably large number of errors involve simple ending differences. However, our current system has no mechanism for taking advantage of this information explicitly, which would likely add substantially to the complexity of the model.</Paragraph>
      <Paragraph position="21"> which simply involves finding the best path a49 a68 through the lattice in Figure 1. Again because of the conditional independence assumption, an efficient solution involves Viterbi decoding over an expanded state space (the product of the names and errors). The sequence a49 a68 can help us define a new word sequence a102a3 that contains a67 -tokens:</Paragraph>
      <Paragraph position="23"> and named entity decoding results in a small degradation in named entity recognition performance, since only a single error path is used. Since errors are not used explicitly in this work, all results are based on the objective given by equation 3.</Paragraph>
      <Paragraph position="24"> Note that, unlike work that uses confidence scores a60 a23 as a weight for the hypothesized word in information retrieval [5], here the confidence scores also provide weights</Paragraph>
      <Paragraph position="26"> a23 a28 for explicit (but unspecified) sets of alternative hypotheses.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Multiple Error Types
</SectionTitle>
      <Paragraph position="0"> Though the model described above uses a single error token a67 and a 2-category word confidence score (correct word vs. error), it is easily extensible to multiple classes of errors simply by expanding the error state space. More specifically, we add multiple parallel arcs in the lattice in Figure 1, labeled a67 a8 , a67a22a106 , etc., and modify confidence estimation to predict multiple categories of errors.</Paragraph>
      <Paragraph position="1"> In this work, we focus particularly on distinguishing out-of-vocabulary (OOV) errors from in-vocabulary (IV) errors, due to the large percentage of OOV words that are names (57% of OOVs occur in named entities). Looking at the data another way, the percentage of name words that are OOV is an order of magnitude larger than words in the &amp;quot;other&amp;quot; phrase category, as described in more detail in [6]. As it turns out, since OOVs are so infrequent, it is difficult to robustly estimate the probability of IV vs.</Paragraph>
      <Paragraph position="2"> OOV errors from standard acoustic features, and we simply use the relative prior probabilities to scale the single error probability.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. CONFIDENCE PREDICTION
</SectionTitle>
    <Paragraph position="0"> An essential component of our error model is the word-level confidence score, a25a31a26 a50a53a23a22a38 a44 a10a33a62a66a28 , so one would expect that better confidence scores would result in better error modeling performance. Hence, we investigated methods for improving the confidence estimates, focusing specifically on introducing new features that might complement the features used to provide the baseline confidence estimates. The baseline confidence scores used in this study were provided by Dragon Systems. As described in [7], the Dragon confidence predictor used a generalized linear model with six inputs: the word duration, the language model score, the fraction of times the word appears in the top 100 hypotheses, the average number of active HMM states in decoding for the word, a normalized acoustic score and the log of the number of recognized words in the utterance. We investigated several new features, of which the most useful are listed below.</Paragraph>
    <Paragraph position="1"> First, we use a short window of the original confidence scores: a60 a23 , a60 a23a43a40 a8 and a60 a23a88a107 a8 . Note that the post-processing paradigm allows us to use non-causal features such as a60 a23a88a107 a8 . We also define three features based on the ratios of a60 a23a41a40 a8 , a60 a23 , and a60 a23a36a107 a8 to the average confidence for the document in which a45 a23 appears, under the assumption that a low confidence score for a word is less likely to indicate a word error if the average confidence for the entire document is also low. We hypothesized that words occurring frequently in a large window would be more likely to be correct, again assuming that the ASR system would make errors randomly from a set of possibilities. Therefore, we define features based on how many times the hypothesis word a45a59a23 occurs in a window a26 a45a59a23a41a40a59a108a109a10a2a12a110a12a110a12a110a10a22a45a59a23a22a10a11a12a111a12a110a12a110a10a22a45a59a23a88a107a109a108a90a28 for a112 a4a114a113 , 10, 25, 50, and 100 words. Finally, we also use the relative frequency of words occurring as an error in the training corpus, again looking at a window of a115a116a54 around the current word.</Paragraph>
    <Paragraph position="2"> Due to the close correlation between names and errors, we would expect to see improvement in the error modeling performance by including information about which words are names, as determined by the NE system. Therefore, in addition to the above set of features, we define a new feature: whether the hypothesis word a45 a23 is part of a location, organization, or person phrase. We can determine the value of this feature directly from the output of the NE system. Given this additional feature, we can define a multi-pass processing cycle consisting of two steps: confidence re-estimation and information extraction. To obtain the name information for the first pass, the confidence scores are re-estimated without using the name features, and these confidences are used in a joint NE and error decoding system. The resulting name information is then used, in addition to all the features used in the previous pass, to improve the word confidence estimates. The improved confidences are in turn used to further improve the performance of the NE system.</Paragraph>
    <Paragraph position="3"> We investigated three different methods for using the above features in confidence estimation: decision trees, generalized linear models, and linear interpolation of the outputs of the decision tree and generalized linear model.</Paragraph>
    <Paragraph position="4"> The decision trees and generalized linear models gave similar performance, and a small gain was obtained by interpolating these predictions. For simplicity, the results here use only the decision tree model.</Paragraph>
    <Paragraph position="5"> A standard method for evaluating confidence prediction [8] is the normalized cross entropy (NCE) of the binary correct/error predictors, that is, the reduction in uncertainty in confidence prediction relative to the ASR system error rate. Using the new features in a decision tree predictor, the NCE score of the binary confidence predictor improved from 0.195 to 0.287. As shown in the next section, this had a significant impact on NE performance.</Paragraph>
    <Paragraph position="6"> (See [6] for further details on these experiments and an analysis of the relative importance of different factors.)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML