XML Viewer - n04-4018

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4018_metho.xml
Size: 10,202 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4018">
  <Title>Improving Automatic Sentence Boundary Detection with Confusion Networks</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Tasks &amp; Baseline
</SectionTitle>
    <Paragraph position="0"> This work specifically detects boundaries of sentence-like units called SUs. An SU roughly corresponds to a sentence, except that SUs are for the most part defined as units that include only one independent main clause, and they may sometimes be incomplete as when a speaker is interrupted and does not complete their sentence. A more specific annotation guideline for SUs is available (Strassel, 2003), which we refer to as the &amp;quot;V5&amp;quot; standard. In this work, we focus only on detecting SUs and do not differentiate among the different types (e.g. statement, question, etc.) that were used for annotation. We work with a relatively new corpus and set of evaluation tools, which are described below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Corpora
</SectionTitle>
      <Paragraph position="0"> The system is evaluated for both conversational telephone speech (CTS) and broadcast news (BN), in both cases using training, development and test data annotated according to the V5 standard. The test data is that used in the DARPA Rich Transcription (RT) Fall 2003 evaluations; the development and evaluation test sets together comprise the Spring 2003 RT evaluation test sets.</Paragraph>
      <Paragraph position="1"> For CTS, there are 40 hours of conversations available for training from the Switchboard corpus, and 3 hours (72 conversation sides) each of development and evaluation test data drawn from both the Switchboard and Fisher corpora. The development and evaluation set each have roughly 6000 SUs.</Paragraph>
      <Paragraph position="2"> The BN data consists of a set of 20 hours of news shows for training, and 3 hours (6 shows) for testing. The development and evaluation test data contains 1.5 hours (3 shows) each for development and evaluation, each with roughly 1000 SUs. Test data comes from the month of February in 2001; training data is taken from a previous time period.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Baseline System
</SectionTitle>
      <Paragraph position="0"> The automatic speech recognition systems used were updated versions of those used by SRI in the Spring 2003 RT evaluations (NIST, 2003), with a WER of 12.1% on BN data and 22.9% on CTS data. Both systems perform multiple recognition and adaptation passes, and eventually produce up to 2000-best hypotheses per waveform segment, which are then rescored with a number of knowledge sources, such as higher-order language models, pronunciation scores, and duration models (for CTS). For best results, the systems combine decoding output from multiple front ends, each producing a separate N-best list. All N-best lists for the same waveform segment are then combined into a single word confusion network (Mangu et al., 2000) from which the hypothesis with lowest expected word error is extracted. In our baseline SU system, the single best word stream thus obtained is then used as the basis for SU recognition.</Paragraph>
      <Paragraph position="1"> Our baseline SU system builds on previous work on sentence boundary detection using lexical and prosodic features (Shriberg et al., 2000). The system takes as input alignments from either reference or recognized (1best) words, and combines lexical and prosodic information using an HMM. Prosodic features include about 100 features reflecting pause, duration, F0, energy, and speaker change information. The prosody model is a decision tree classifier that generates the posterior probability of an SU boundary at each interword boundary given the prosodic features. Trees are trained from sampled training data in order to make the model sensitive to features of the minority SU class. Recent prosody model improvements include the use of bagging techniques in decision tree training to reduce the variability due to a single tree (Liu et al., 2003). Language model improvements include adding information from a POS-based model, a model using automatically-induced word classes, and a model trained on separate data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> Errors are measured by a slot error rate similar to the WER metric utilized by the speech recognition community, i.e. dividing the total number of inserted and deleted SUs by the total number of reference SUs. (There are no substitution errors because there is only one sentence class.) When recognition output is used, the words will generally not align perfectly with the reference transcription and hence the SU boundary predictions will require some alignment procedure to match to the reference location. Here, the alignment is based on the minimum word error alignment of the reference and hypothesized word strings, and the minimum SU error alignment if the WER is equal for multiple alignments. We report numbers computed with the su-eval scoring tool from NIST.</Paragraph>
      <Paragraph position="1"> SU error rates for the reference words condition of our baseline system are 49.04% for BN, and 30.13% for CTS, as reported at the NIST RT03F evaluation (Liu et al., 2003). Results for the automatic speech recognition condition are described in Section 5.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Using N-Best Sentence Hypotheses
</SectionTitle>
    <Paragraph position="0"> The large increase in SU detection error rate in moving from reference to recognizer transcripts motivates an approach that reduces the mistakes introduced by word recognition errors. Although the best recognizer output is optimized to reduce word error rate, alternative hypotheses may together reinforce alternative (more accurate) SU predictions. The oracle WER for the confusion networks is much lower than for the single best hypothesis, in the range of 13-16% WER for the CTS test sets.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Feature Extraction and SU Detection
</SectionTitle>
      <Paragraph position="0"> Prediction of SUs using multiple hypotheses requires prosodic feature extraction for each hypothesis, which in turn requires a forced alignment of each hypothesis.</Paragraph>
      <Paragraph position="1"> Thousands of hypotheses are output by the recognizer, but we prune to a smaller set to reduce the cost of running forced alignments and prosodic feature extraction.</Paragraph>
      <Paragraph position="2"> The recognizer outputs an N-best list of hypotheses and assigns a posterior probability to each hypothesis, which is normalized to sum to 1 over all hypotheses. We collect hypotheses from the N-best list for each acoustic segment up to 90% of the posterior mass (or to a maximum count of 1000).</Paragraph>
      <Paragraph position="3"> Next, forced alignment and prosodic feature extraction are run for all segments in this pruned set of hypotheses. Statistics for prosodic feature normalization (such as speaker and turn F0 mean) are collected from the single best hypothesis. After obtaining the prosodic features, the HMM predicts sentence boundaries for each word sequence hypothesis independently. For each hypothesis, an SU prediction is made at all word boundaries, resulting in a posterior probability for SU and no SU at each boundary. The same models are used as in the 1-best predictions - no parameters were re-optimized for the N-best framework. Given independent predictions for the individual hypotheses, we then build a system to incorporate the multiple predictions into a single hypothesis, as described next.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Combining Hypotheses
</SectionTitle>
      <Paragraph position="0"> The prediction results for an individual hypothesis are represented in a confusion network that consists of a series of word slots, each followed by a slot with SU and no SU, as shown in Figure 1 with hypothetical confidences for the between-word events. (This representation is a somewhat unusual form because the word slots have only a single hypothesis.) The words in the individual hypotheses have probability one, and each arc with an SU or no SU token has a confidence (posterior probability) assigned from the HMM. The overall network has a score associated with its N-best hypothesis-level posterior probability, scaled by a weight corresponding to the goodness of the system that generated that hypothesis.</Paragraph>
      <Paragraph position="1">  The confusion networks for each hypothesis are then merged with the SRI Language Modeling Toolkit (Stolcke, 2002) to create a single confusion network for an overall hypothesis. This confusion network is derived from an alignment of the confusion networks of each individual hypothesis. The resulting network contains slots with the word hypotheses from the N-best list and slots with the combined SU/no SU probability, as shown in Figure 2. The confidences assigned to each token in the new confusion network are a weighted linear combination of the probabilities from individual hypotheses that align to each other, compiled from the entire hypothesis list, where the weights are the hypothesis-level scores from the recognizer.</Paragraph>
      <Paragraph position="2">  Finally, the best decision at each point is selected by choosing the words and boundaries with the highest probability. Here, the words and SUs are selected independently, so that we obtain the same words as would be selected without inserting the SU tokens and guarantee no degradation in WER. The key improvement is that the SU detection is now a result of detection across all recognizer hypotheses, which reduces the effect of word errors in the top hypothesis.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML