File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1086_metho.xml

Size: 16,334 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1086">
  <Title>Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> The data for this study were taken from the Switchboard Corpus (Godfrey et al., 1992), which consists of 2430 telephone conversations between adult speakers (approximately 2.4 million words). Participants were both male and female and represented all major dialects of American English. We used a portion of this corpus that was phonetically hand-transcribed (Greenberg et al., 1996) and segmented into speech boundaries at turn boundaries or pauses of more than 500 ms on both sides. Fragments contained seven words on average. Additionally, each word was coded for probabilistic and contextual information, such as word frequency, conditional probabilities, the rate of speech, and the canonical pronunciation (Fosler-Lussier and Morgan, 1999).</Paragraph>
    <Paragraph position="1"> The dataset used in all analysis in this study consists of only the first hour of the database, comprised of 1,824 utterances with 13,190 words. These utterances were hand coded for pitch accent and intonational phrase brakes.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Pitch Accent Coding
</SectionTitle>
      <Paragraph position="0"> The utterances were hand labeled for accents and boundaries according to the Tilt Intonational Model (Taylor, 2000). This model is characterized by a series of intonational events: accents and boundaries. Labelers were instructed to use duration, amplitude, pausing information, and changes in f0 to identify events. In general, labelers followed the basic conventions of EToBI for coding (Taylor, 2000).</Paragraph>
      <Paragraph position="1"> However, the Tilt coding scheme was simplified.</Paragraph>
      <Paragraph position="2"> Accents were coded as either major or minor (and some rare level accents) and breaks were either rising or falling. Agreement for the Tilt coding was reported at 86%. The CU coding also used a simplified EToBI coding scheme, with accent types conflated and only major breaks coded. Accent and break coding pair-wise agreement was between 8595% between coders, with a kappa of 71%-74% where is the difference between expected agreement and actual agreement.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Variables
</SectionTitle>
    <Paragraph position="0"> The label we were predicting was a binary distinction of accented or not. The variables we used for prediction fall into three main categories: syntactic, probabilistic variables, which include word frequency and collocation measures, and phonological variables, which capture aspects of rhythm and timing that affect accentuation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Syntactic variables
</SectionTitle>
      <Paragraph position="0"> The only syntactic category we used was a four-way classification for hand-generated part of speech (POS): Function, Noun, Verb, Other, where Other includes all adjectives and adverbs1. Table 1 gives the percentage of accented and unaccented items by</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Probabilistic variables
</SectionTitle>
      <Paragraph position="0"> Following a line of research that incorporates the information content of a word as well as collocation measures (Pan and McKeown, 1999; Pan and Hirschberg, 2001) we have included a number of probabilistic variables. The probabilistic variables we used were the unigram frequency, the predictability of a word given the preceding word (bigram), the predictability of a word given the following word (reverse bigram), the joint probability of a word with the preceding (joint), and the joint probability of a word with the following word (reverse joint). Table 2 provides the definition for these, as well as high probability examples from the corpus (the emphasized word being the current target).</Paragraph>
      <Paragraph position="1"> Note all probabilistic variables were in log scale.</Paragraph>
      <Paragraph position="2"> The values for these probabilities were obtained using the entire 2.4 million words of SWBD2. Table</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 presents the Spearman's rank correlation coeffi-
</SectionTitle>
    <Paragraph position="0"> cient between the probabilistic measures and accent (Conover, 1980). These values indicate the strong correlation of accents to the probabilistic variables.</Paragraph>
    <Paragraph position="1"> As the probability increases, the chance of an accent decreases. Note that all values are significant at the p&lt;:001 level.</Paragraph>
    <Paragraph position="2"> We also created a combined part of speech and unigram frequency variable in order to have a variable that corresponds to the variable used in (Pan 2Our current implementation of CRF only takes categorical variables, thus for the experiments, all probabilistic variables were binned into 5 equal categories. We also tried more bins and produced similar results, so we only report on the 5-binned categories. We computed correlations between pitch accent and the original 5 variables as well as the binned variables and they are very similar.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Phonological variables
</SectionTitle>
      <Paragraph position="0"> The last category of predictors, phonological variables, concern aspects of rhythm and timing of an utterance. We have two main sources for these variables: those that can be computed solely from a string of text (textual), and those that require some sort of acoustic information (acoustic). Sun (2002) demonstrated that the number of phones in a syllable, the number of syllables in a word, and the position of a word in a sentence are useful predictors of which syllables get accented. While Sun was concerned with predicting accented syllables, some of the same variables apply to word level targets as well. For our textual phonological features, we included the number of syllables in a word and the number of phones (both in citation form as well as transcribed form). Instead of position in a sentence, we used the position of the word in an utterance since the fragments do not necessarily correspond to sentences in the database we used. We also made use of the utterance length. Below is the list of our textual features: Number of canonical syllables Number of canonical phones Number of transcribed phones The length of the utterance in number of words The position of the word in the utterance The main purpose of this study is to better predict which words in a string of text receive accent. So far, all of our predictors are ones easily computed from a string of text. However, we have included a few variables that affect the likelihood of a word being accented that require some acoustic data. To the best of our knowledge, these features have not been used in acoustic models of pitch accent prediction. These features include the duration of the word, speech rate, and following intonational phrase boundaries. Given the nature of the SWBD corpus, there are many disfluencies. Thus, we also  pitch accent prediction.</Paragraph>
      <Paragraph position="1"> included following pauses and filled pauses as predictors. Below is the list of our acoustic features: Log of duration in milliseconds normalized by number of canonical phones binned into 5 equal categories.</Paragraph>
      <Paragraph position="2"> Log Speech Rate; calculated on strings of speech bounded on either side by pauses of 300 ms or greater and binned into 5 equal categories. null Following pause; a binary distinction of whether a word is followed by a period of silence or not.</Paragraph>
      <Paragraph position="3"> Following filled pause; a binary distinction of whether a word was followed by a filled pause (uh, um) or not.</Paragraph>
      <Paragraph position="4"> Following IP boundary Table 4 indicates that each of these features significantly affect the presence of pitch accent. While certainly all of these variables are not independent of on another, using CRFs, one can incorporate all of these variables into the pitch accent prediction model with the advantage of making use of the dependencies among the labels.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Surrounding Information
</SectionTitle>
      <Paragraph position="0"> Sun (2002) has shown that the values immediately preceding and following the target are good predictors for the value of the target. We also experimented with the effects of the surrounding values by varying the window size of the observation-label feature extraction described in Section 2. When the window size is 1, only values of the word that is labelled are incorporated in the model. When the window size is 3, the values of the previous and the following words as well as the current word are incorporated in the model. Window size 5 captures the values of the current word, the two previous words and the two following words.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> All experiments were run using 10 fold crossvalidation. We used Viterbi decoding to find the most likely sequence and report the performance in terms of label accuracy. We ran all experiments with varying window sizes (w 2 f1;3;5g). The baseline which simply assigns the most common label, unaccented, achieves 60:53 1:50%.</Paragraph>
    <Paragraph position="1"> Previous research has demonstrated that part of speech and frequency, or a combination of these two, are very reliable predictors of pitch accent.</Paragraph>
    <Paragraph position="2"> Thus, to test the worthiness of using a CRF model, the first experiment we ran was a comparison of an HMM to a CRF using just the combination of part of speech and unigram. The HMM score (referred as HMM:POS, Unigram in Table 5) was 68:62 1:78, while the CRF model (referred as CRF:POS, Uni-gram in Table 5) performed significantly better at 72:56 1:86. Note that Pan and McKeown (1999) reported 74% accuracy with their HMM model.</Paragraph>
    <Paragraph position="3"> The difference is due to the different corpora used in each case. While they also used spontaneous speech, it was a limited domain in the sense that it was speech from discharge orders from doctors at one medical facility. The SWDB corpus is open domain conversational speech.</Paragraph>
    <Paragraph position="4"> In order to capture some aspects of the IC and collocational strength of a word, in the second experiment we ran part of speech plus all of the probabilistic variables (referred as CRF:POS, Prob in Table 5). The model accuracy was 73.94%, thus improved over the model using POS and unigram values by 1.38%.</Paragraph>
    <Paragraph position="5"> In the third experiment we wanted to know if TTS applications that made use of purely textual input could be aided by the addition of timing and rhythm variables that can be gleaned from a text string.</Paragraph>
    <Paragraph position="6"> Thus, we included the textual features described in Section 4.3 in addition to the probabilistic and syntactic features (referred as CRF:POS, Prob, Txt in Table 5). The accuracy was improved by 1.73%.</Paragraph>
    <Paragraph position="7"> For the final experiment, we added the acoustic variable, resulting in the use of all the variables described in Section 4 (referred as CRF:All in Table 5). We get about 0.5% increase in accuracy, 76.1% with a window of size w = 1.</Paragraph>
    <Paragraph position="8"> Using larger windows resulted in minor increases in the performance of the model, as summarized in</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Pitch accent prediction is a difficult task, in that, the number of different speakers, topics, utterance fragments and disfluent production of the SWBD corpus only increase this difficulty. The fact that 21% of the function words are accented indicates that models of pitch accent that mostly rely on part of speech and unigram frequency would not fair well with this corpus. We have presented a model of pitch accent that captures some of the other factors that influence accentuation. In addition to adding more probabilistic variables and phonological factors, we have used a sequence model that captures the interdependence of accents within a phrase.</Paragraph>
    <Paragraph position="1"> Given the distinct natures of corpora used, it is difficult to compare these results with earlier models. However, in experiment 1 (HMM: POS, Uni-gram vs CRF: POS, Unigram) we have shown that a CRF model achieves a better performance than an HMM model using the same features. However, the real strength of CRFs comes from their ability to incorporate different sources of information efficiently, as is demonstrated in our experiments.</Paragraph>
    <Paragraph position="2"> We did not test directly the probabilistic measures (or collocation measures) that have been used before for this task, namely information content (IC) (Pan and McKeown, 1999) and mutual information (Pan and Hirschberg, 2001). However, the measures we have used encompass similar information. For example, IC is only the additive inverse of our unigram measure:</Paragraph>
    <Paragraph position="4"> Rather than using mutual information as a measure of collocational strength, we used unigram, bigram and joint probabilities. A model that includes both joint probability and the unigram probabilities of wi and wi 1 is comparable to one that includes mutual information.</Paragraph>
    <Paragraph position="5"> Just as the likelihood of a word being accented is influenced by a following silence or IP boundary, the collocational strength of the target word with the following word (captured by reverse bi-gram and reverse joint) is also a factor. With the use of POS, unigram, and all bigram and joint probabilities, we have shown that (a) CRFs outperform HMMs, and (b) our probabilistic variables increase accuracy from a model that include POS + unigram (73.94% compared to 72.56%).</Paragraph>
    <Paragraph position="6"> For tasks in which pitch accent is predicted solely based on a string of text, without the addition of acoustic data, we have shown that adding aspects of rhythm and timing aids in the identification of accent targets. We used the number of words in an utterance, where in the utterance a word falls, how long in both number of syllables and number of phones all affect accentuation. The addition of these variables improved the model by nearly 2%.</Paragraph>
    <Paragraph position="7"> These results suggest that Accent prediction models that only make use of textual information could be improved with the addition of these variables.</Paragraph>
    <Paragraph position="8"> While not trying to provide a complete model of accentuation from acoustic information, in this study we tested a few acoustic variables that have not yet been tested. The nature of the SWBD corpus allowed us to investigate the role of disfluencies and widely variable durations and speech rate on accentuation. Especially speech rate, duration and surrounding silence are good predictors of pitch accent. The addition of these predictors only slightly improved the model (about .5%). Acoustic features are very sensitive to individual speakers. In the corpus, there are many different speakers of varying ages and dialects. These variables might become more useful if one controls for individual speaker differences. To really test the usefulness of these variables, one would have to combine them with acoustic features that have been demonstrated to be good predictors of pitch accent (Sun, 2002; Conkie et al., 1999; Wightman et al., 2000).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML