XML Viewer - p03-1062

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1062_intro.xml
Size: 12,484 bytes
Last Modified: 2025-10-06 14:01:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1062">
  <Title>Learning to predict pitch accents and prosodic boundaries in Dutch</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Any text-to-speech (TTS) system that aims at producing understandable and natural-sounding output needs to have on-board methods for predicting prosody. Most systems start with generating a prosodic representation at the linguistic or symbolic level, followed by the actual phonetic realization in terms of (primarily) pitch, pauses, and segmental durations. The first step involves placing pitch accents and inserting prosodic boundaries at the right locations (and may involve tune choice as well). Pitch accents correspond roughly to pitch movements that lend emphasis to certain words in an utterance. Prosodic breaks are audible interruptions in the flow of speech, typically realized by a combination of a pause, a boundary-marking pitch movement, and lengthening of the phrase-final segments. Errors at this level may impede the listener in the correct understanding of the spoken utterance (Cutler et al., 1997). Predicting prosody is known to be a hard problem that is thought to require information on syntactic boundaries, syntactic and semantic relations between constituents, discourse-level knowledge, and phonological well-formedness constraints (Hirschberg, 1993). However, producing all this information - using full parsing, including establishing semanto-syntactic relations, and full discourse analysis - is currently infeasible for a real-time system. Resolving this dilemma has been the topic of several studies in pitch accent placement (Hirschberg, 1993; Black, 1995; Pan and McKeown, 1999; Pan and Hirschberg, 2000; Marsi et al., 2002) and in prosodic boundary placement (Wang and Hirschberg, 1997; Taylor and Black, 1998). The commonly adopted solution is to use shallow information sources that approximate full syntactic, semantic and discourse information, such as the words of the text themselves, their part-of-speech tags, or their information content (in general, or in the text at hand), since words with a high (semantic) information content or load tend to receive pitch accents (Ladd, 1996).</Paragraph>
    <Paragraph position="1"> Within this research paradigm, we investigate pitch accent and prosodic boundary placement for Dutch, using an annotated corpus of newspaper text, and machine learning algorithms to produce classifiers for both tasks. We address two questions that  have been left open thus far in previous work: 1. Is there an advantage in inducing decision trees for both tasks, or is it better to not abstract from individual instances and use a memory-based k-nearest neighbour classifier? 2. Is there an advantage in inducing classifiers for  both tasks individually, or can both tasks be learned together.</Paragraph>
    <Paragraph position="2"> The first question deals with a key difference between standard decision tree induction and memory-based classification: how to deal with exceptional instances. Decision trees, CART (Classification and Regression Tree) in particular (Breiman et al., 1984), have been among the first successful machine learning algorithms applied to predicting pitch accents and prosodic boundaries for TTS (Hirschberg, 1993; Wang and Hirschberg, 1997). Decision tree induction finds, through heuristics, a minimallysized decision tree that is estimated to generalize well to unseen data. Its minimality strategy makes the algorithm reluctant to remember individual outlier instances that would take long paths in the tree: typically, these are discarded. This may work well when outliers do not reoccur, but as demonstrated by (Daelemans et al., 1999), exceptions do typically reoccur in language data. Hence, machine learning algorithms that retain a memory trace of individual instances, like memory-based learning algorithms based on the k-nearest neighbour classifier, outperform decision tree or rule inducers precisely for this reason.</Paragraph>
    <Paragraph position="3"> Comparing the performance of machine learning algorithms is not straightforward, and deserves careful methodological consideration. For a fair comparison, both algorithms should be objectively and automatically optimized for the task to be learned.</Paragraph>
    <Paragraph position="4"> This point is made by (Daelemans and Hoste, 2002), who show that, for tasks such as word-sense disambiguation and part-of-speech tagging, tuning algorithms in terms of feature selection and classifier parameters gives rise to significant improvements in performance. In this paper, therefore, we optimize both CART and MBL individually and per task, using a heuristic optimization method called iterative deepening.</Paragraph>
    <Paragraph position="5"> The second issue, that of task combination, stems from the intuition that the two tasks have a lot in common. For instance, (Hirschberg, 1993) reports that knowledge of the location of breaks facilitates accent placement. Although pitch accents and breaks do not consistently occur at the same positions, they are to some extent analogous to phrase chunks and head words in parsing: breaks mark boundaries of intonational phrases, in which typically at least one accent is placed. A learner may thus be able to learn both tasks at the same time.</Paragraph>
    <Paragraph position="6"> Apart from the two issues raised, our work is also practically motivated. Our goal is a good algorithm for real-time TTS. This is reflected in the type of features that we use as input. These can be computed in real-time, and are language independent.</Paragraph>
    <Paragraph position="7"> We intend to show that this approach goes a long way towards generating high-quality prosody, casting doubt on the need for more expensive sentence and discourse analysis.</Paragraph>
    <Paragraph position="8"> The remainder of this paper has the following structure. In Section 2 we define the task, describe the data, and the feature generation process which involves POS tagging, syntactic chunking, and computing several information-theoretic metrics. Furthermore, a brief overview is given of the algorithms we used (CART and MBL). Section 3 describes the experimental procedure (ten-fold iterative deepening) and the evaluation metrics (F-scores). Section 4 reports the results for predicting accents and major prosodic boundaries with both classifiers. It also reports their performance on held-out data and on two fully independent test sets. The final section offers some discussion and concluding remarks.</Paragraph>
    <Paragraph position="9"> 2 Task definition, data, and machine learners To explore the generalization abilities of machine learning algorithms trained on placing pitch accents and breaks in Dutch text, we define three classification tasks: Pitch accent placement - given a word form in its sentential context, decide whether it should be accented. This is a binary classification task.</Paragraph>
    <Paragraph position="10"> Break insertion - given a word form in its sentential context, decide whether it should be followed by a boundary. This is a binary classification task.</Paragraph>
    <Paragraph position="11"> Combined accent placement and break insertion - given a word form in its sentential context, decide whether it should be accented and whether it should be followed by a break. This is a four-class task: no accent and no break; an accent and no break; no accent and a break; an accent and a break.</Paragraph>
    <Paragraph position="12"> Finer-grained classifications could be envisioned, e.g. predicting the type of pitch accent, but we assert that finer classification, apart from being arguably harder to annotate, could be deferred to later processing given an adequate level of precision and recall on the present task.</Paragraph>
    <Paragraph position="13"> In the next subsections we describe which data we selected for annotation and how we annotated it with respect to pitch accents and prosodic breaks. We then describe the implementation of memory-based learning applied to the task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Prosodic annotation of the data
</SectionTitle>
      <Paragraph position="0"> The data used in our experiments consists of 201 articles from the ILK corpus (a large collection of Dutch newspaper text), totalling 4,493 sentences and 58,097 tokens (excluding punctuation). We set apart 10 articles, containing 2,905 tokens (excluding punctuation) as held-out data for testing purposes.</Paragraph>
      <Paragraph position="1"> As a preprocessing step, the data was tokenised by a rule-based Dutch tokeniser, splitting punctuation from words, and marking sentence endings.</Paragraph>
      <Paragraph position="2"> The articles were then prosodically annotated, without overlap, by four different annotators, and were corrected in a second stage, again without overlap, by two corrector-annotators. The annotators' task was to indicate the locations of accents and/or breaks that they preferred. They used a custom annotation tool which provided feedback in the form of synthesized speech. In total, 23,488 accents were placed, which amounts to roughly one accent in two and a half words. 8627 breaks were marked; 4601 of these were sentence-internal breaks; the remainder consisted of breaks at the end of sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Generating shallow features
</SectionTitle>
      <Paragraph position="0"> The 201 prosodically-annotated articles were subsequently processed through the following 15 feature construction steps, each contributing one feature per word form token. An excerpt of the annotated data with all generated symbolic and numeric1 features is presented in Table 1.</Paragraph>
      <Paragraph position="1"> Word forms (Wrd) - The word form tokens form the central unit to which other features are added.</Paragraph>
      <Paragraph position="2"> Pre- and post-punctuation - All punctuation marks in the data are transferred to two separate features: a pre-punctuation feature (PreP) for punctuation marks such as quotation marks appearing before the token, and a post-punctuation feature (PostP) for punctuation marks such as periods, commas, and question marks following the token.</Paragraph>
      <Paragraph position="3"> Part-of-speech (POS) tagging - We used MBT version 1.0 (Daelemans et al., 1996) to develop a memory-based POS tagger trained on the Eindhoven corpus of written Dutch, which does not overlap with our base data. We split up the full POS tags into two features, the first (PosC) containing the main POS category, the second (PosF) the POS subfeatures. null Diacritical accent - Some tokens bear an orthographical diacritical accent put there by the author to particularly emphasize the token in question. These accents were stripped off the accented letter, and transferred to a binary feature (DiA).</Paragraph>
      <Paragraph position="4"> NP and VP chunking (NpC &amp; VpC) - An approximation of the syntactic structure is provided by simple noun phrase and verb phrase chunkers, which take word and POS information as input and are based on a small number of manually written regular expressions. Phrase boundaries are encoded per word using three tags: 'B' for chunk-initial words, 'I' for chunk-internal words, and 'O' for words outside chunks. The NPs are identified according to the base principle of one semantic head per chunk (nonrecursive, base NPs). VPs include only verbs, not the verbal complements.</Paragraph>
      <Paragraph position="5"> IC - Information content (IC) of a word w is given by IC(w) = [?]log(P(w)), where P(w) is esti1Numeric features were rounded off to two decimal points, where appropriate.</Paragraph>
      <Paragraph position="6"> mated by the observed frequency of w in a large disjoint corpus of about 1.7 GB of unannotated Dutch text garnered from various sources. Word forms not in this corpus were given the highest IC score, i.e.</Paragraph>
      <Paragraph position="7"> the value for hapax legomenae (words that occur once).</Paragraph>
      <Paragraph position="8"> Bigram IC - IC on bigrams (BIC) was calculated for the bigrams (pairs of words) in the data, according to the same formula and corpus material as for unigram IC.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML