File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1029_metho.xml

Size: 19,639 bytes

Last Modified: 2025-10-06 14:10:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1029">
  <Title>Unsupervised and Semi-supervised Learning of Tone and Pitch Accent</Title>
  <Section position="3" start_page="224" end_page="226" type="metho">
    <SectionTitle>
2 Data Sets
</SectionTitle>
    <Paragraph position="0"> We consider two corpora: one in English for pitch accent recognition and two in Mandarin for tone recognition. We introduce each briefly below.</Paragraph>
    <Section position="1" start_page="224" end_page="224" type="sub_section">
      <SectionTitle>
2.1 English Corpus
</SectionTitle>
      <Paragraph position="0"> We employ a subset of the Boston Radio News Corpus (Ostendorf et al., 1995), read by female speaker F2B, comprising 40 minutes of news material. The corpus includes pitch accent, phrase and boundary tone annotation in the ToBI framework (Silverman et al., 1992) aligned with manual transcription and syllabification of the materials. Following earlier research (Ostendorf and Ross, 1997; Sun, 2002), we collapse the ToBI pitch accent labels to four classes: unaccented, high, low, and downstepped high for experimentation. null</Paragraph>
    </Section>
    <Section position="2" start_page="224" end_page="225" type="sub_section">
      <SectionTitle>
2.2 Mandarin Chinese Tone Data
</SectionTitle>
      <Paragraph position="0"> Mandarin Chinese is a language with lexical tone in which each syllable carries a tone and the meaning of the syllable is jointly determined by the tone and segmental information. Mandarin Chinese has four canonical lexical tones, typically described as follows: 1) high level, 2) mid-rising, 3) low fallingrising, and 4) high falling.1 The canonical pitch con- null tours for these tones appear in Figure 1.</Paragraph>
      <Paragraph position="1"> We employ data from two distinct sources in the experiments reported here.</Paragraph>
      <Paragraph position="2">  The first data set is very clean speech data drawn from a collection of read speech collected under laboratory conditions by (Xu, 1999). In these materials, speakers read a set of short sentences where syllable tone and position of focus were varied to assess the effects of focus position on tone realization. Focus here corresponds to narrow focus, where speakers were asked to emphasize a particular word or syllable. Tones on focussed syllables were found to conform closely to the canonical shapes described above, and in previous supervised experiments using a linear support vector machine classifier trained on focused syllables, accuracy approached 99%. For these materials, pitch tracks were manually aligned to the syllable and automatically smoothed and timenormalized by the original researcher, resulting in 20 pitch values for each syllable.</Paragraph>
      <Paragraph position="3">  The second data set is drawn from the Voice of America Mandarin broadcast news, distributed by the Linguistic Data Consortium2, as part of the Topic Detection and Tracking (TDT-2) evaluation. Using the corresponding anchor scripts, automatically word-segmented, as gold standard transcription, audio from the news stories was force-aligned to the text transcripts. The forced alignment employed the language porting functionality of the University of speech data described below contains no such instances.  2001). A mapping from the transcriptions to English phone sequences supported by Sonic was created using a Chinese character-pinyin pronunciation dictionary and a manually constructed mapping from pinyin sequences to the closest corresponding English phone sequences.3</Paragraph>
    </Section>
    <Section position="3" start_page="225" end_page="226" type="sub_section">
      <SectionTitle>
2.3 Acoustic Features
</SectionTitle>
      <Paragraph position="0"> Using Praat's (Boersma, 2001) &amp;quot;To pitch&amp;quot; and &amp;quot;To intensity&amp;quot; functions and the alignments generated above, we extract acoustic features for the prosodic region of interest. This region corresponds to the &amp;quot;final&amp;quot; region of each syllable in Chinese, including the vowel and any following nasal, and to the syllable nucleus in English.4 For all pitch and intensity features in both datasets, we compute per-speaker z-score normalized log-scaled values. We extract pitch values from points across valid pitch tracked regions in the syllable. We also compute mean pitch across the syllable. Recent phonetic research (Xu, 1997; Shih and Kochanski, 2000) has identified significant effects of carryover coarticulation from preceding adjacent syllable tones. To minimize these effects consistent with the pitch target approximation model (Xu et al., 1999), we compute slope features based on the second half of this final region, where this model predicts that the underlying pitch height and slope targets of the syllable will be most accurately approached. We further log-scale and normalize slope values to compensate for greater speeds of pitch fall than pitch rise(Xu and Sun, 2002).</Paragraph>
      <Paragraph position="1"> We consider two types of contextualized features as well, to model and compensate for coarticulatory effects from neighboring syllables. The first set of features, referred to as &amp;quot;extended features&amp;quot;, includes the maximum and mean pitch from adjacent syllables as well as the nearest pitch point or points from the preceding and following syllables. These features extend the modeled tone beyond the strict bounds of the syllable segmentation. A second set of contextual features, termed &amp;quot;difference features&amp;quot;, captures the change in pitch maximum, mean, midpoint, and slope as well as intensity maximum be- null tween the current syllable and the previous or following syllable.</Paragraph>
      <Paragraph position="2"> In prior supervised experiments using support vector machines(Levow, 2005), variants of this representation achieved competitive recognition levels for both tone and pitch accent recognition. Since many of the experiments for Mandarin Chinese tone recognition deal with clean, careful lab speech, we anticipate little coarticulatory influence, and use a simple pitch-only context-free representation for our primary Mandarin tone recognition experiments.</Paragraph>
      <Paragraph position="3"> For primary experiments in pitch accent recognition, we employ a high-performing contextualized representation in (Levow, 2005), using both &amp;quot;extended&amp;quot; and &amp;quot;difference&amp;quot; features computed only on the preceding syllable. We will also report some contrastive experimental results varying the amount of contextual information.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="226" end_page="226" type="metho">
    <SectionTitle>
3 Unsupervised and Semi-supervised
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="226" end_page="226" type="sub_section">
      <SectionTitle>
Learning
</SectionTitle>
      <Paragraph position="0"> The bottleneck of time and monetary cost associated with manual annotation has generated significant interest in the development of techniques for machine learning and classification that reduce the amount of annotated data required for training. Likewise, learning from unlabeled data aligns with the perspective of language acquisition, as child learners must identify these linguistic categories without explicit instruction by observation of natural language interaction. Of particular interest are techniques in unsupervised and semi-supervised learning where the structure of unlabeled examples may be exploited. Here we consider both unsupervised techniques with no labeled training data and semi-supervised approaches where unlabeled training data is used in conjunction with small amounts of labeled data.</Paragraph>
      <Paragraph position="1"> A wide variety of unsupervised clustering techniques have been proposed. In addition to classic clustering techniques such as k-means, recent work has shown good results for many forms of spectral clustering including those by (Shi and Malik, 2000; Belkin and Niyogi, 2002; Fischer and Poland, 2004). In the unsupervised experiments reported here, we employ asymmetric k-lines clustering by (Fischer and Poland, 2004) using code available at the authors' site, as our primary unsupervised learning approach. Asymmetric clustering is distinguished from other techniques by the construction and use of context-dependent kernel radii. Rather than assuming that all clusters are uniform and spherical, this approach enhances clustering effectiveness when clusters may not be spherical and may vary in size and shape. We will see that this flexibility yields a good match to the structure of Mandarin tone data where both shape and size of clusters vary across tones. In additional contrastive experiments reported below, we also compare k-means clustering, symmetric k-lines clustering (Fischer and Poland, 2004), and Laplacian Eigenmaps (Belkin and Niyogi, 2002) with k-lines clustering.</Paragraph>
      <Paragraph position="2"> The spectral techniques all perform spectral decomposition on some representation of the affinity or adjacency graph.</Paragraph>
      <Paragraph position="3"> For semi-supervised learning, we employ learners in the Manifold Regularization framework developed by (Belkin et al., 2004). This work postulates an underlying intrinsic distribution on a low dimensional manifold for data with an observed, ambient distribution that may be in a higher dimensional space. It further aims to preserve locality in that elements that are neighbors in the ambient space should remain &amp;quot;close&amp;quot; in the intrinsic space. A semi-supervised classification algorithm, termed &amp;quot;Laplacian Support Vector Machines&amp;quot;, allows training and classification based on both labeled and unlabeled training examples.</Paragraph>
      <Paragraph position="4"> We contrast results under both unsupervised and semi-supervised learning with most common class assignment and previous results employing fully supervised approaches, such as SVMs.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="226" end_page="228" type="metho">
    <SectionTitle>
4 Unsupervised Clustering Experiments
</SectionTitle>
    <Paragraph position="0"> We executed four sets of experiments in unsupervised clustering using the (Fischer and Poland, 2004) asymmetric clustering algorithm.</Paragraph>
    <Section position="1" start_page="226" end_page="227" type="sub_section">
      <SectionTitle>
4.1 Experiment Configuration
</SectionTitle>
      <Paragraph position="0"> In these experiments, we chose increasingly difficult and natural test materials. In the first experiment with the cleanest data, we used only focused syllables from the read Mandarin speech dataset.</Paragraph>
      <Paragraph position="1"> In the second, we included both in-focus (focused)  and pre-focus syllables from the read Mandarin speech dataset.5 In the third and fourth experiments, we chose subsets of broadcast news report data, from the Voice of America (VOA) in Mandarin and Boston University Radio News corpus in English.</Paragraph>
      <Paragraph position="2"> In all experiments on Mandarin data, we performed clustering on a balanced sampling set of tones, with 100 instances from each class6, yielding a baseline for assignment of a single class to all instances of 25%. We then employed a two-stage repeated clustering process, creating 2 or 3 clusters at each stage.</Paragraph>
      <Paragraph position="3"> For experiments on English data, we extracted a set of 1000 instances, sampling pitch accent types according to their frequency in the collection. We performed a single clustering phase with 2 to 16 clusters, reporting results at different numbers of clusters.</Paragraph>
      <Paragraph position="4"> For evaluation, we report accuracy based on assigning the most frequent class label in each cluster to all members of the cluster.</Paragraph>
    </Section>
    <Section position="2" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
4.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> We find that in all cases, accuracy based on the asymmetric clustering is significantly better than most common class assignment and in some cases approaches labelled classification accuracy. Unsurprisingly, the best results, in absolute terms, are achieved on the clean focused syllables, reaching 87% accuracy. For combined in-focus and pre-focus syllables, this rate drops to 77%. These rates contrast with 99-93% accuracies in supervised classification using linear SVM classifiers with several thousand labelled training examples(Surendran et al., 2005).</Paragraph>
      <Paragraph position="1"> On broadcast news audio, accuracy for Mandarin reaches 57%, still much better than the 25% level, though below a 72% accuracy achieved using supervised linear SVMs with 600 labeled training examples. Interestingly, for English pitch accent recognition, accuracy reaches 78.4%, aproaching the 80.1% 5Post-focus syllables typically have decreased pitch height and range, resulting in particularly poor recognition accuracy. We chose not to concentrate on this specific tone modeling problem here.</Paragraph>
      <Paragraph position="2"> 6Sample sizes were bounded to support rapid repeated experimentation and for consistency with the relatively small VOA data set.</Paragraph>
      <Paragraph position="3">  learners across numbers of clusters.</Paragraph>
      <Paragraph position="4"> accuracy achieved with SVMs on a comparable data representation.</Paragraph>
    </Section>
    <Section position="3" start_page="227" end_page="228" type="sub_section">
      <SectionTitle>
4.3 Contrastive Experiments
</SectionTitle>
      <Paragraph position="0"> We further contrast the use of different unsupervised learners, comparing the three spectral techniques and k-means with Euclidean distance. All contrasts are presented for English pitch accent classification, ranging over different numbers of clusters, with the best parameter setting of neighborhood size. The results are illustrated in Figure 2. K-means and the asymmetric clustering technique are presented for the clean focal Mandarin speech under the standard two stage clustering, in Table 1.</Paragraph>
      <Paragraph position="1"> The asymmetric k-lines clustering approach consistently outperforms the corresponding symmetric clustering learner, as well as Laplacian Eigenmaps with binary weights for pitch accent classification.</Paragraph>
      <Paragraph position="2"> Somewhat surprisingly, k-means clustering outperforms all of the other approaches when producing 314 clusters. Accuracy for the optimal choice of clusters and parameters is comparable for asymmetric k-lines clustering and k-means, and somewhat better than all other techniques considered. The careful feature selection process for tone and pitch accent modeling may reduce the difference between the spectral and k-means approaches. In contrast, for the four tone classification task in Mandarin using two stage clustering with 2 or 3 initial clusters, the best clustering using asymmetric k-lines strongly outperforms k-means.</Paragraph>
      <Paragraph position="3"> We also performed a contrastive experiment in pitch accent recognition in which we excluded contextual information from both types of contextual features. We find little difference for the majority of  Open Diamond: High tone (1), Filled black traingle: Rising tone (2), Filled grey square: Low tone (3), X: Falling tone (4) the unsupervised clustering algorithms, with results from symmetric, asymmetric and k-means clustering differing by less than 1% in absolute accuracy. It is, however, worth noting that exclusion of these features from experiments using supervised learning led to a 4% absolute reduction in accuracy.</Paragraph>
    </Section>
    <Section position="4" start_page="228" end_page="228" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> An examination of both the clusters formed and the structure of the data provides insight into the effectiveness of this process. Figure 3 displays 2 dimensions of the Mandarin four-tone data from the focused read speech, where normalized pitch mean is on the x-axis and slope is on the y-axis. The separation of classes and their structure is clear. One observes that rising tone (tone 2) lies above the x-axis, while high-level (tone 1) lies along the x-axis. Low (tone 3) and falling (tone 4) tones lie mostly below the x-axis as they generally have falling slope. Low tone (3) appears to the left of falling tone (4) in the figure, corresponding to differences in mean pitch.</Paragraph>
      <Paragraph position="1"> In clustering experiments, an initial 2- or 3-way split separates falling from rising or level tones based on pitch slope. The second stage of clustering splits either by slope (tones 1,2, some 3) or by pitch height (tones 3,4). These clusters capture the natural structure of the data where tones are characterized by pitch height and slope targets.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="228" end_page="229" type="metho">
    <SectionTitle>
5 Semi-supervised Learning
</SectionTitle>
    <Paragraph position="0"> By exploiting a semi-supervised approach, we hope to enhance classification accuracy over that achievable by unsupervised methods alone by incorporating small amounts of labeled data while exploiting the structure of the unlabeled examples.</Paragraph>
    <Section position="1" start_page="228" end_page="228" type="sub_section">
      <SectionTitle>
5.1 Experiment Configuration
</SectionTitle>
      <Paragraph position="0"> We again conduct contrastive experiments using both the clean focused read speech and the more challenging broadcast news data. In each Mandarin case, for each class, we use only a small set (40) of labeled training instances in conjunction with an additional sixty unlabeled instances, testing on 40 instances. For English pitch accent, we restricted the task to the binary classification of syllables as accented or unaccented. For the one thousand samples we proportionally labeled 200 unaccented examples and 100 accented examples. 7 We configure the Laplacian SVM classification with binary neighborhood weights, radial basis function kernel, and cosine distance measure typically with 6 nearest neighbors. Following (C-C.Cheng and Lin, 2001), for a0 -class classification we train a1a3a2a4a1a3a5a7a6a9a8 a10 binary classifiers. We then classify each test instance using all of the classifiers and assign the most frequent prediction, with ties broken randomly. We contrast these results both with conventional SVM classification with a radial basis function kernel excluding the unlabeled training examples and with most common class assignment, which gives a 25% baseline.</Paragraph>
    </Section>
    <Section position="2" start_page="228" end_page="229" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> For the Mandarin focused read syllables, we achieve  For the noisier broadcast news data, the accuracy is 70% for the comparable task. These results all substantially outperform the 25% most common class assignment level. The semi-supervised classifier also reliably outperforms an SVM classifier with an RBF kernel trained on the same labeled training instances. This baseline SVM classifier with a very small training set achieves 81% accuracy on clean read speech, but only a0 35% on the broadcast news speech. Finally, for English pitch accent recognition in broadcast news data, the classifier achieves 81.5%, relative to 84% accuracy in the fully supervised case.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML