XML Viewer - p03-1062

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1062_evalu.xml
Size: 5,751 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1062">
  <Title>Learning to predict pitch accents and prosodic boundaries in Dutch</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Tenfold iterative deepening results
</SectionTitle>
      <Paragraph position="0"> We first determined two sharp, informed baselines; see Table 2. The informed baseline for accent placement is based on the content versus function word distinction, commonly employed in TTS systems (Taylor and Black, 1998). We refer to this baseline as CF-rule. It is constructed by accenting all content words, while leaving all function words (determiners, prepositions, conjunctions/complementisers and auxiliaries) unaccented. The required word class information is obtained from the POS tags. The base-line for break placement, henceforth PUNC-rule, relies solely on punctuation. A break is inserted after any sequence of punctuation symbols containing one  and combined prediction by means of CART and MBL, for baselines and for average results over 10 folds of the Iterative Deepening experiment; a [?] indicates a significant difference (p &lt; 0.01) between CART and MBL according to a paired t-test. Superscript C refers to the combined task.</Paragraph>
      <Paragraph position="1"> or more characters from the set {,!?:;()}. It should be noted that both baselines are simple rule-based algorithms that have been manually optimized for the current training set. They perform well above chance level, and pose a serious challenge to any ML approach.</Paragraph>
      <Paragraph position="2"> From the results displayed in Table 2, the following can be concluded. First, MBL attains the highest F-scores on accent placement, 83.6, and break placement, 88.0. It does so when trained on the ACCENT and BREAK tasks individually. On these tasks, MBL performs significantly better than CART (paired t-tests yield p &lt; 0.01 for both differences).</Paragraph>
      <Paragraph position="3"> Second, the performances of MBL and CART on the combined task, when split in F-scores on accent and break placement, are rather close to those on the accent and break tasks. For both MBL and CART, the scores on accent placement as part of the combined task versus accent placement in isolation are not significantly different. For break insertion, however, a small but significant drop in performance can be seen with MBL (p &lt; 0.05) and CART (p &lt; 0.01) when it is performed as part of the COMBINED task.</Paragraph>
      <Paragraph position="4"> As is to be expected, the optimal feature selections and classifier settings obtained by iterative deepening turned out to vary over the ten folds for both MBL and CART. Table 3 lists the settings producing the best F-score on accents or breaks. A window of 7 (i.e. the features of the three preceding and following word form tokens) is used by CART and MBL for accent placement, and also for break insertion by CART, whereas MBL uses a window of  respect to accent and break prediction just 3. Both algorithms (stop in CART, and k in MBL) base classifications on minimally around 25 instances. Furthermore, MBL uses the Gain Ratio feature weighting and Exponential Decay distance weighting. Although no pruning was part of the Iterative Deepening experiment, CART prefers to hold out 5% of its training material to prune the decision tree resulting from the remaining 95%.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 External validation
</SectionTitle>
      <Paragraph position="0"> We tested our optimized approach on our held-out data of 10 articles (2,905 tokens), and on an independent test corpus (van Herwijnen and Terken, 2001).</Paragraph>
      <Paragraph position="1"> The latter contains two types of text: 2 newspaper texts (55 sentences, 786 words excluding punctuation), and 17 email messages (70 sentences, 1133 words excluding punctuation). This material was annotated by 10 experts, who were asked to indicate the preferred accents and breaks. For the purpose of evaluation, words were assumed to be accented if they received an accent by at least 7 of the annotators. Furthermore, of the original four break levels annotated (i.e. no break, light, medium, or heavy ), only medium and heavy level breaks were considered to be a break in our evaluation. Table 4 lists the precision, recall, and F-scores obtained on the two tasks using the single-best scoring setting from the 10-fold CV ID experiment per task. It can be seen that both CART and MBL outperformed the CF-rule baseline on our own held-out data and on the news and email texts, with similar margins as observed in our 10-fold CV ID experiment. MBL attains an F-score of 86.6 on accents, and 91.0 on breaks; both are improvements over the cross-validation estimations. On breaks, however, both CART and MBL failed to improve on the PUNC-rule baseline; on the news and email texts they perform even worse. Inspecting MBLs output on these text, it turned out that MBL does emulate the PUNC-rule baseline, but that it places additional breaks at positions not  prediction for our held-out corpus and two external corpora of news and email texts, using the best settings for CART and MBL as determined by the ID experiments.</Paragraph>
      <Paragraph position="2"> marked by punctuation. A considerable portion of these non-punctuation breaks is placed incorrectly or at least different from what the annotators preferred - resulting in a lower precision that does not outweigh the higher recall.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML