File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/h01-1034_evalu.xml

Size: 4,410 bytes

Last Modified: 2025-10-06 13:58:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1034">
  <Title>Improving Information Extraction by Modeling Errors in Speech Recognizer Output</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4. EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> The specific information extraction task we address in this work is the identification of name phrases (names of persons, locations, and organizations), as well as identification of temporal and numeric expressions, in the ASR output. Also known as named entities (NEs), these phrases are useful in many language understanding tasks, such as coreference resolution, sentence chunking and parsing, and summarization/gisting.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data and Evaluation Method
</SectionTitle>
      <Paragraph position="0"> The data we used for the experiments described in this paper consisted of 114 news broadcasts automatically annotated with recognition confidence scores and hand labeled with NE types and locations. The data represents an intersection of the data provided by Dragon Systems for the 1998 DARPA-sponsored Hub-4 Topic, Detection and Tracking (TDT) evaluation and those stories for which named entity labels were available. Broadcast news data is particularly appropriate for our work since it contains a high density of name phrases, has a relatively high word error rate, and requires a virtually unlimited vocabulary.</Paragraph>
      <Paragraph position="1"> We used two versions of each news broadcast: a reference transcription prepared by a human annotator and an ASR transcript prepared by Dragon Systems for the TDT evaluation [7]. The Dragon ASR system had a vocabulary size of about 57,000 words and a word error rate (WER) of about 30%. The ASR data contained the word-level confidence information, as described earlier, and the reference transcription was manually-annotated with named entity information. By aligning the reference and ASR transcriptions, we were able to determine which ASR output words corresponded to errors and to the NE phrases.</Paragraph>
      <Paragraph position="2"> We randomly selected 98 of the 114 broadcasts as training data, 8 broadcasts as development test, and 8 broadcasts as evaluation test data, which were kept &amp;quot;blind&amp;quot; to ensure unbiased evaluation results. We used the training data to estimate all model parameters, the development test set to tune parameters during development, and the evaluation test set for all results reported here. For all experiments we used the same training and test data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Information Extraction Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the performance of the baseline information extraction system (row 1) which does not model errors, compared to systems using one and two error types, with the baseline confidence estimates and the improved confidence estimates from the previous section. Performance figures are the standard measures used for this task: F-measure (harmonic mean of recall and precision) and slot error rate (SER), where separate type, extent and content error measures are averaged to get the reported result.</Paragraph>
      <Paragraph position="1"> The results show that modeling errors gives a significant improvement in performance. In addition, there is a small but consistent gain from modeling OOV vs. IV errors separately. Further gain is provided by each improvement to the confidence estimator.</Paragraph>
      <Paragraph position="2"> Since the evaluation criterion involves a weighted average of content, type and extent errors, there is an upper bound of 86.4 for the F-measure given the errors in the recognizer output. In other words, this is the best performance we can hope for without running additional processing to correct the ASR errors. Thus, the combined error modeling improvements lead to recovery of 28% of the possible performance gains from this scheme. It is also interesting to note that the improvement in identifying the extent of a named entity actually results in a decrease in performance of the content component, since words that are incorrectly recognized are introduced into the named entity regions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML