File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1018_evalu.xml

Size: 10,404 bytes

Last Modified: 2025-10-06 13:59:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1018">
  <Title>Detecting Structural Metadata with Decision Trees and Transformation-Based Learning</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> For training our system and its components, we used two different subsets of Switchboard, a corpus of conversational telephone speech (CTS) (Godfrey et al., 1992).</Paragraph>
      <Paragraph position="1"> One of the data sets included 417 conversations (LDC1.3) that were hand-annotated by the Linguistic Data Consortium for dis uencies and SUs according to the V5 guidelines detailed in (Strassel, 2003). Another set of 1086 conversations from the Switchboard corpus was annotated according to (Meteer et al., 1995) and is available as part of the Treebank3 corpus (TB3). We used a version of this set that contained annotations machine-mapped to approximate the V5 annotation speci cation.</Paragraph>
      <Paragraph position="2"> For development and testing of our system, we used hand transcripts and STT system output for 72 conversations from Switchboard and the Fisher corpus, a recent CTS data collection. Half of these conversations were held out and used as development data (dev set), and the other 36 conversations were used as test data (eval set).</Paragraph>
      <Paragraph position="3"> The STT output, used only in testing, was from a state-of-the-art large vocabulary conversational speech recognizer developed by BBN. The word error rates for the STT output were 27% on the dev set and 25% on the eval set.</Paragraph>
      <Paragraph position="4"> To assess the performance of our overall system, disuencies and boundary events were predicted and then evaluated by the scoring tools developed for the NIST Rich Transcript evaluation task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Boundary Event Prediction
</SectionTitle>
      <Paragraph position="0"> Decision trees to predict boundary events were trained and tested using the IND system developed by NASA (Buntine and Caruan, 1991). All decision trees were pruned by ten-fold cross validation. The LDC1.3 set2 with reference transcriptions was used to train the trees3 and the dev set was used to evaluate their performances. null Several decision trees with different combinations of feature groups were trained to assess the usefulness of different knowledge sources for boundary event detection. The tree was then used to predict the boundary events on the reference transcription of the dev set. The results are presented in Table 4. The inclusion of a special token for fragments resulted in improved precision and recall for SUs and IPs but, surprisingly, degraded performance for ISUs. These results show that prosodic fea- null they lead to performance gains when combined with lexical cues. Examination of the decision tree trained with only the prosodic features revealed that pause duration and turn information features were placed near the top of the tree.</Paragraph>
      <Paragraph position="1"> Use of lexical features brought substantial performance improvement in all aspects, and classi cation accuracy increased when features extracted from different knowledge sources were combined. However, we observed that a smaller number of prosodic features ended up being used in the tree and they were placed at or near leaf nodes as more lexical features were made available for training. The importance of prosodic features is likely to be much more apparent for STT data. The word errors prevalent in the STT transcriptions will affect lexical features far more severely than prosodic features, and therefore the prosodic features contribute to the robustness of the overall system when lexical features become less reliable. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Edit and Filler Detection
</SectionTitle>
      <Paragraph position="0"> After the prediction of boundary events, the rules learned by the TBL system described in section 4.3 were applied to detect llers and edit regions. As with the decision trees, we trained rules using the LDC1.3 data alone, and combined with the mapped TB3 data, nding that the combined dataset gave better results for TBL training.</Paragraph>
      <Paragraph position="1"> Again we used only reference word transcripts but discovered that training with SUs and IPs predicted by the rst stage of our system was more effective than using reference boundary events.</Paragraph>
      <Paragraph position="2"> It is dif cult to formally assess the effectiveness of the TBL module independently, and results for the entire system are discussed in detail in the next section. Informal inspection of the rules learned by the TBL system indicates that, not surprisingly, word match features and the presence of IPs are very important for the detection of edit regions. The most commonly used features for identifying discourse markers are the identity or POS of the</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Overall System Results
</SectionTitle>
      <Paragraph position="0"> The performance of our system was evaluated on the fall 2003 NIST Rich Transcription Evaluation test set (RT03F) using the rt-eval scoring tool (NIST, 2003), which combines ISUs and SUs in a single category, and reports results for detection of SUs, IPs, llers, and edits without differentiating subcategories of llers and edits. This tool produces a collection of results, including percentage correct, deletions, insertions, and Slot Error Rate (SER), similar to the word error rate measure used in speech recognition. SER is de ned as the number of insertions and deletions divided by the number of reference items.</Paragraph>
      <Paragraph position="1"> Note that scores are somewhat different from those in  Table 5 for the joint tree version of the system as applied to the STT transcription of the test data. SU detection by our system is relatively good. IP detection is not as successful, which also impacts edit detection.</Paragraph>
      <Paragraph position="2"> Figure 2 contrasts the results of the joint tree model for STT output with those obtained on reference data with and without fragments. As expected, all error rates are higher on STT output; IPs and llers take the biggest hit.</Paragraph>
      <Paragraph position="3"> Filler performance in particular seems to be affected by recognition errors, which is not surprising, since misrecognized words would likely not be on the target lists of lled pauses and discourse markers. In particular, nearly all missed and incorrectly inserted lled pauses are due to recognition errors. Detection of discourse markers is more challenging; fewer than half the errors on discourse markers are due to recognition errors. Most non-STTrelated ller errors involved the words so and like used as DMs, which are hard problems since the vast majority of the occurrences of these two words are not DMs.</Paragraph>
      <Paragraph position="4"> It is also not surprising that improved IP detection on reference data contributes to a lower error rate for edits.</Paragraph>
      <Paragraph position="5"> As expected, the inclusion of fragments improves performance on IP and edit detection, where fragments frequently occur. In LDC1.3, 17.2% of edit IPs have word fragments occurring before them; 9.9% of edits consist of just a single fragment. In the dev set, 35.5% of edit IPs are associated with fragments. However, fragments are rarely output by the STT system, so for most of our work we chose to use the identical system for processing reference and STT transcripts and did not include fragments. IP detection performance was signi cantly worse for those IPs associated with fragments, as shown in Table 6. However, since fragments are often deleted or recognized as a full word, STT output actually helps with detection of IPs after fragments, apparently because the POS tagger and hidden event LM tend to give unreliable results on the reference transcripts near fragments.</Paragraph>
      <Paragraph position="6"> Figure 3 compares the eval test set performances of the different alternatives for incorporating the hidden event LM posterior, i.e. inclusion in the decision tree, linear interpolation and the joint HMM. For this experiment, the interpolation weighting factor was selected empirically to maximize boundary event prediction accuracy on the STT transcription of the dev set. The results of this comparison are mixed: SU detection is better with the joint tree model, but IP detection and consequently edit  detection are better with the interpolation and HMM approaches. The degradation of SU detection performance with the HMM is counter to ndings in previous work (Stolcke et al., 1998; Shriberg et al., 2000). This may be due to differences in evaluation criteria, given that the HMM approach typically had higher precision which might bene t earlier word-based measures more. In addition, the difference in conclusions may be due to the fact that the decision trees used here include lexical pattern match features in addition to hidden event posteriors.</Paragraph>
      <Paragraph position="7"> A problem in our system is the inability to predict more than one label for a given word or boundary. Words labeled as both ller and edit account for only 0.5% of all llers and edits in the LDC1.3 training data, so it is probably not a signi cant problem. We also do not predict boundaries as both SU and IP. In LDC1.3, these account for 12.8% of SU boundaries, and are treated as simply SU in training. This does not affect IPs for edits, but impacts 38.6% of IPs before llers. By predicting a combined SU-IP boundary in addition to isolated SUs and IPs, we obtain a small reduction in SER for IPs but at the expense of an increase in SU SER. However, separating prediction of IPs after edit regions vs. before llers also yields small improvements in edit region precision and ller recall, resulting in 3.3% and 0.8% relative reduction in ller and edit SERs respectively for the joint HMM.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML